apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.63k stars 1.41k forks source link

Add filter null check for ColumnIndex #2517

Open asfimport opened 4 years ago

asfimport commented 4 years ago

This Jira is opened for discussion that should we add null checking for the filter when ColumnIndex is enabled.

In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 'filter' is assumed to be non-null without checking. It throws NPE when ColumnIndex is enabled(by default) but there is no filter set in the ParquetReadOptions. The call stack is as below. java.lang.NullPointerException at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)

If we don't add, the user might need to choose to call readNextRowGroup() or readFilteredNextRowGroup() accordingly based on filter existence.

Thoughts?

Reporter: Xinli Shang / @shangxinli Assignee: Xinli Shang / @shangxinli

Note: This issue was originally created as PARQUET-1901. Please see the migration documentation for further details.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: It is clear we shall handle this case properly. I've quickly checked the other filters (DictionaryFilter, StatisticsFilter and BloomFilterImpl) and neither handles the case of the filter being null (meaning they all throw NPE). So, I would vote on not checking for the filter being null in ColumnIndexFilter. Instead, the places where it is invoked shall handle the case of a null filter like here.

asfimport commented 4 years ago

Xinli Shang / @shangxinli: Hi @rdblue, please comment on this if you have different opinions. This is found during the ColumnIndex integration to Iceberg. We would need to handle the null checking in Iceberg anyway before Parquet 1.12.0.

asfimport commented 4 years ago

Ryan Blue / @rdblue: It isn't clear to me how a filter implementation would handle the filter itself being null. It could return a default value to accept/read, but that runs into issues when filters like not(null) are passed in. So I agree with Gabor that it makes sense for a null filter to be an exceptional case in the filter implementations themselves.

But I would expect a method like calculateRowRanges to correctly return the default RowRanges.createSingle(rowCount) if that method were passed a null value, since it is not actually processing the filter.

For Iceberg, I'm wondering if it wouldn't be easier to implement our own filter implementation that produced row ranges and passed them in. That's how we filter row groups and I think it has been much easier not needing to convert to Parquet filters, which are difficult to work with.

asfimport commented 4 years ago

Xinli Shang / @shangxinli: I have the initial version of Iceberg integration working in my private repo https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661. It can skip the pages now based on the column index. But it is very initial version and I didn't finalize it yet, also no tests are added. I also didn't get time to address your feedback to idtoAlias comments yet. But I hope it can give you an idea ON what the integration looks like.

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: Still want to work on this for 1.12.0?

asfimport commented 3 years ago

Xinli Shang / @shangxinli: For now, I think we can move it to the next release.