Open johnnystargazer opened 2 years ago
Hello,
Is there any news on this issue? We have a similar issue on our side. As we add more parquet files in our data lake, the planning time increases due to Drill opening every single parquet under 'selectionRoot' even if dir
columns are specified.
NOTE: the problem seems to only appear with JOIN.
I have the same issue using partitioned directories containing parquet files (tested with csv files give the same results) The more files the slower the query gets ... see extracts of drillbits.log with DEBUG level showing that all files are scanned... DrillBits.log
Describe the bug Drill scan all the parquet file from query root for metadata if there is a "inner join " in query.
To Reproduce Steps to reproduce the behavior:
Expected behavior As we only query t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' , and invalite file is under dir2="2010-01-01" , the expected behavior is drill perform query without any error, but it it return data.parquet is not a Parquet file, the result approve that drill scan all the parquet file from query root directory.
Screenshots
Additional context Drill return successfully if no inner join in query