Closed rommelDB closed 3 years ago
If I had to guess, when you run the query with optimized relational algebra, the filter is in the table scan which will mean that the query will be preprocessed on the metadata, while if its not optimized the filter is in a different relational algebra step and therefore the query will not be pre-processed on the metadata, which leads me to believe that the issue is somehow related to the processing of the metadata. We could confirm this by looking at the beginning of the execution logs RAL.0.log and see how many files or rowgroups the query is operating on. I would guess that when the relational algebra is optimized it is operating on less files.
If so lets start investigating the metadata itself first.
For this we want to first make sure that the metadata is being captured correctly. So after we create the table:
bc.create_table('nyc_taxi', "small-*.parquet")
we can look at the metadata:
print(bc.tables['nyc_taxi'].metadata)
In the metadata there should be a min and max for total_amount
for every rowgroup and file.
We should see if that data makes sense. We can go a step further and validate that if we queried the min and max of total_amount
for every file individually, it should match what is in the metadata. If it does not, then we know that we are likely not parsing the metadata correctly and we can investigate in parquet_metadata.cpp
Describe the bug Count(*) gives wrong results for some parquet files with metadata. Curiously, if we call bc.sql() with the non-optimized logical plan as input, the output is right.
Steps/Code to reproduce bug Reproducer script:
Sample data: https://drive.google.com/drive/folders/1cjeLkqDTkcMFKY6aNplD_OnXuJeLFCWW?usp=sharing
Output:
Expected behavior Output should be the same for both optimized and non-optimized logical plans.
Environment overview