NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
772 stars 227 forks source link

[BUG] Delta Lake metadata query detection can trigger extra file listing jobs #9604

Open jlowe opened 9 months ago

jlowe commented 9 months ago

isDeltaLakeMetadataQuery can invoke inputFiles on a FileSourceScanExec's relation, and on highly partitioned data sources this will often trigger a Spark job to do the listing of files in the table. Users have seen extra stages to do file listings appear that have been triggered by isDeltaLakeMetadataQuery. Setting spark.rapids.sql.detectDeltaLogQueries to false causes these extra stages to disappear.

jlowe commented 2 weeks ago

We may be able to do the metadata detection much cheaper by checking rootPaths on the FileIndex rather than inputFiles which probably would avoid doing anything really expensive. I suspect we'll see the special metadata directories in the rootPaths results on metadata queries without needing a full file listing, but this needs to be verified.