NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
771 stars 227 forks source link

[Audit][SPARK-41509][SQL] Only execute `Murmur3Hash` on aggregate expressions for semi-join runtime filter #7525

Open wbo4958 opened 1 year ago

wbo4958 commented 1 year ago

This PR https://github.com/apache/spark/commit/739aae1554 changed the semi-join runtime filter in logical plan, which may result in spark-rapids. we'd better verify it.

revans2 commented 1 year ago

This change is just a few lines that operate on the logical plan. This should have no impact to our ability to run the resulting command.

It would be nice to see how the configs that they set operate in practice for the GPU, and if we can implement something to improve the NDS performance using them. But they are both off by default, so we would need to understand why. One of them uses a BloomFilter aggregation and check with an xxhash64. We don't currently support either of them, and we are not likely to support them any time soon.

The other one uses a sub-query to try and do something similar, so it might be interesting to see what it does by itself.

spark.sql.optimizer.runtime.bloomFilter.enabled=false
spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled=true