apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.63k stars 1.41k forks source link

Empty projection returns the wrong number of rows when column index is enabled #2702

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Discovered in Spark, when returning an empty projection from a Parquet file with filter pushdown enabled (typically when doing filter + count), Parquet-Mr returns a wrong number of rows with column index enabled. When the column index feature is disabled, the result is correct.

 

This happens due to the following:

  1. ParquetFileReader::getFilteredRowCount() (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851) selects row ranges to calculate the row count when column index is enabled.
  2. In ColumnIndexFilter (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80) we filter row ranges and pass the set of paths which in this case is empty.
  3. When evaluating the filter, if the column path is not in the set, we would return an empty list of rows (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178).) which is always the case for an empty projection.
  4. This results in the incorrect number of records reported by the library.

    I will provide the full repro later.

     

     

Reporter: Ivan Sadikov

Related issues:

Note: This issue was originally created as PARQUET-2170. Please see the migration documentation for further details.

asfimport commented 2 years ago

Ivan Sadikov: I will update the description later and I would like to open a PR to fix the issue. I think we just need to check if the column set is empty or not when checking paths in the ColumnIndexFilter but I will need to confirm this.