apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 431 forks source link

Parquet Filter predicate storing nested string causing OOM's #403

Open asfimport opened 1 year ago

asfimport commented 1 year ago

Each Instance of ColumnFilterPredicate stores the filter values in toString variable eagerly. Which is not useful


static abstract class ColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable  {
  private final Column<T> column;
  private final T value;
  private final String toString; 

protected ColumnFilterPredicate(Column<T> column, T value) {
  this.column = Objects.requireNonNull(column, "column cannot be null");

  // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, so they guard against
  // null in their own constructors.
  this.value = value;

  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + value + ")";
}

 

 

If your filter predicate is too long/nested this can take a lot of memory while creating Filter. We have seen in our productions this can go upto 4gbs of space while opening multiple parquet readers

Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is eagerly calculated and stored in string and lot of duplication is happening while making And/or filter.

I did not find use case of storing it so eagerly

Reporter: Abhishek Jain

Note: This issue was originally created as PARQUET-2220. Please see the migration documentation for further details.

asfimport commented 1 year ago

Abhishek Jain: Can anyone from parquet contributors take a look on this ? 

asfimport commented 1 year ago

Abhishek Jain: very sorry for tagging @gszadovszky @theosib-amazon . Just want to get this noticed

asfimport commented 1 year ago

Gabor Szadovszky / @gszadovszky: [~abhiSumo304], I agree eagerly storing the toString value is not a good idea. I don't think it has proper use case either. toString should be used for debugging purposes anyway so eagerly storing the value does not really make sense. Unfortunately, I don't work on the Parquet code base actively anymore. Feel free to put up a PR to fix this and I'll try to review it in time.