apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.94k stars 980 forks source link

Drill scan all the parquet file from query root for metadata if there is a "inner join " in query #2402

Open johnnystargazer opened 2 years ago

johnnystargazer commented 2 years ago

Describe the bug Drill scan all the parquet file from query root for metadata if there is a "inner join " in query.

To Reproduce Steps to reproduce the behavior:

# /data/01/ is the query root in this case
# prepare the directory for drill to use in query
mkdir -p /data/01/2021/11/2021-11-23/
# download a parquet file for drill to query
curl https://raw.githubusercontent.com/apache/drill/master/sample-data/nation.parquet -o /data/01/2021/11/2021-11-23/data.parquet

# prepre inner join directory ,
mkdir -p /data/PRO/item
# we prepare a invalid parquet file, this file is not supposed to be scan in query
mkdir -p /data/01/2010/01/2010-01-01/
echo "abc" > /data/01/2010/01/2010-01-01/data.parquet

# query drill endpoint by curl
json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM  dfs.\`/data/01\` as t INNER JOIN dfs.\`/data/PRO/item\` item  ON t.N_REGIONKEY = item.ID WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' AND (REPEATED_CONTAINS(item.CATEGORIES,1031) OR REPEATED_CONTAINS(item.CATEGORIES,1047))\", \"autoLimit\":1}"
drill_host="localhost:8047"
curl -XPOST  -H "Content-Type: application/json" "$drill_host/query.json" -d "$json"

Expected behavior As we only query t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30' , and invalite file is under dir2="2010-01-01" , the expected behavior is drill perform query without any error, but it it return data.parquet is not a Parquet file, the result approve that drill scan all the parquet file from query root directory.

Screenshots

{
  "errorMessage" : "SYSTEM ERROR: RuntimeException: file:/data/01/2010/01/2010-01-01/data.parquet is not a Parquet file (too small length: 4)\n\n\nPlease, refer to logs for more information.\n\n[Error Id: ce4e61af-5df8-440e-81d2-673c89106e5f on drill-0.drill:31010]"
}

Additional context Drill return successfully if no inner join in query

# query drill endpoint by curl
json="{\"queryType\":\"SQL\", \"query\": \"SELECT COUNT(*) FROM  dfs.\`/data/01\` as t WHERE t.dir2 >='2021-11-23' AND t.dir2<='2021-11-30'\", \"autoLimit\":1}"
drill_host="localhost:8047"
curl -XPOST  -H "Content-Type: application/json" "$drill_host/query.json" -d "$json"
 {
  "queryId" : "1e4ce295-3052-a66c-b68f-96cf4a97806d",
  "columns" : [ "EXPR$0" ],
  "rows" : [ {
    "EXPR$0" : "25"
  } ],
  "metadata" : [ "BIGINT" ],
  "queryState" : "COMPLETED",
  "attemptedAutoLimit" : 1
}
philou93 commented 1 year ago

Hello,

Is there any news on this issue? We have a similar issue on our side. As we add more parquet files in our data lake, the planning time increases due to Drill opening every single parquet under 'selectionRoot' even if dir columns are specified.

NOTE: the problem seems to only appear with JOIN.

Loy2 commented 1 year ago

I have the same issue using partitioned directories containing parquet files (tested with csv files give the same results) The more files the slower the query gets ... see extracts of drillbits.log with DEBUG level showing that all files are scanned... DrillBits.log