Closed res-life closed 1 year ago
does this require specific configuration, like using CONVERT_TIME, or does it also happen with TASK_TIME?
I suspect that this logic (in AlluxioUtils.scala
) is failing for some reason on Databricks 10.4. It should be filtering out the DynamicPruningExpression in this case. Since Databricks evaluates dynamicpruningexpressions differently than Apache Spark, there might be something missing in the Databricks case at this point:
// With the base Spark FileIndex type we don't know how to modify it to
// just replace the paths so we have to try to recompute.
def isDynamicPruningFilter(e: Expression): Boolean =
e.find(_.isInstanceOf[PlanExpression[_]]).isDefined
val partitionDirs = relation.location.listFiles(
partitionFilters.filterNot(isDynamicPruningFilter), dataFilters)
Figured out how to reproduce this:
First set spark.rapids.alluxio.replacement.algo
to CONVERT_TIME
. Then, either tune spark.rapids.alluxio.large.file.threshold
or just set spark.rapids.alluxio.slow.disk
to false
. Basically you need to not read directly from S3 during the scan which includes the dynamicpruningexpression.
Seems did not ever test CONVERT_TIME algo + large file
feature before this issue was filed.
@res-life can you confirm you needed to use CONVERT_TIME? One of our customers reported failures (though we don't have logs) and they were not using CONVERT_TIME
Just tested, The CONVERT_TIME algorithm runs into error. The TASK_TIME algorithm (default) is OK.
Details: Spark config:
spark.rapids.alluxio.slow.disk false
spark.task.resource.gpu.amount 0.125
spark.shuffle.manager com.nvidia.spark.rapids.spark321db.RapidsShuffleManager
spark.hadoop.fs.s3a.access.key {{secrets/chongg-s3/access_key}}
spark.plugins com.nvidia.spark.SQLPlugin
spark.locality.wait 3s
spark.rapids.alluxio.automount.enabled true
spark.rapids.memory.pinnedPool.size 4G
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.secret.key {{secrets/chongg-s3/secret_access_key}}
spark.sql.files.maxPartitionBytes 1G
spark.rapids.sql.multiThreadedRead.numThreads 100
spark.rapids.sql.concurrentGpuTasks 2
spark.rapids.alluxio.home /opt/alluxio-2.9.0
Environment variables
ENABLE_ALLUXIO=1
Describe the bug NDS running hits DPP error on Databricks 10.4 when enable Alluxio cache.
Steps/Code to reproduce bug Create a Databrick cluster. Run NDS test against the cluster
Environment details Databricks Runtime Version: 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) How to setup Databrick cluster: https://github.com/NVIDIA/spark-rapids-container/tree/dev/Databricks
Additional context Seems DB upgraded recently, now the tests failed on DPP. Nothing is changed, only restarted the cluster, so the DB is the suspect.
Details log
Query7 is:
If can't reproduce it, reach to me.