BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.93k stars 183 forks source link

[BUG] Count(*) gives wrong results for some parquet files with metadata #1121

Closed rommelDB closed 3 years ago

rommelDB commented 4 years ago

Describe the bug Count(*) gives wrong results for some parquet files with metadata. Curiously, if we call bc.sql() with the non-optimized logical plan as input, the output is right.

Steps/Code to reproduce bug Reproducer script:

from blazingsql import BlazingContext
bc = BlazingContext()
bc.create_table('nyc_taxi', "small-*.parquet")

print(" -- blz")
res = bc.sql('select count(*) from nyc_taxi where total_amount > 60')
print(res)

plan = '''LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
  LogicalProject($f0=[0])
    BindableTableScan(table=[[main, nyc_taxi]], filters=[[>($0, 60)]], projects=[[0]], aliases=[[$f0]])\n'''
res = bc.sql(algebra=plan, query="")
print(" -- blz-optimized")
print(res)

plan = '''LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
  LogicalProject($f0=[0])
    LogicalFilter(condition=[>($0, 60)])
      LogicalTableScan(table=[[main, nyc_taxi]])\n'''
res = bc.sql(algebra=plan, query="")
print(" -- blz-non-optimized")
print(res)

Sample data: https://drive.google.com/drive/folders/1cjeLkqDTkcMFKY6aNplD_OnXuJeLFCWW?usp=sharing

Output:

listening: tcp://*:28624
BlazingContext ready
 -- blz
   count(*)
0         6     <------- WRONG!
 -- blz-optimized
   count(*)
0         6     <------- WRONG!
 -- blz-non-optimized
   count(*)
0         9     <------- OK!

Expected behavior Output should be the same for both optimized and non-optimized logical plans.

Environment overview

wmalpica commented 4 years ago

If I had to guess, when you run the query with optimized relational algebra, the filter is in the table scan which will mean that the query will be preprocessed on the metadata, while if its not optimized the filter is in a different relational algebra step and therefore the query will not be pre-processed on the metadata, which leads me to believe that the issue is somehow related to the processing of the metadata. We could confirm this by looking at the beginning of the execution logs RAL.0.log and see how many files or rowgroups the query is operating on. I would guess that when the relational algebra is optimized it is operating on less files.

If so lets start investigating the metadata itself first.

For this we want to first make sure that the metadata is being captured correctly. So after we create the table: bc.create_table('nyc_taxi', "small-*.parquet") we can look at the metadata: print(bc.tables['nyc_taxi'].metadata) In the metadata there should be a min and max for total_amount for every rowgroup and file. We should see if that data makes sense. We can go a step further and validate that if we queried the min and max of total_amount for every file individually, it should match what is in the metadata. If it does not, then we know that we are likely not parsing the metadata correctly and we can investigate in parquet_metadata.cpp