Open asfimport opened 1 year ago
Abhishek Jain: Can anyone from parquet contributors take a look on this ?
Abhishek Jain: very sorry for tagging @gszadovszky @theosib-amazon . Just want to get this noticed
Gabor Szadovszky / @gszadovszky:
[~abhiSumo304]
, I agree eagerly storing the toString value is not a good idea. I don't think it has proper use case either. toString should be used for debugging purposes anyway so eagerly storing the value does not really make sense. Unfortunately, I don't work on the Parquet code base actively anymore. Feel free to put up a PR to fix this and I'll try to review it in time.
Each Instance of ColumnFilterPredicate stores the filter values in toString variable eagerly. Which is not useful
If your filter predicate is too long/nested this can take a lot of memory while creating Filter. We have seen in our productions this can go upto 4gbs of space while opening multiple parquet readers
Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is eagerly calculated and stored in string and lot of duplication is happening while making And/or filter.
I did not find use case of storing it so eagerly
Reporter: Abhishek Jain
Note: This issue was originally created as PARQUET-2220. Please see the migration documentation for further details.