NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
771 stars 227 forks source link

[FEA][Audit] [SPARK-37629][SQL] Speed up Expression.canonicalized #4496

Open nartal1 opened 2 years ago

nartal1 commented 2 years ago

Is your feature request related to a problem? Please describe. We have copied some of the code from Object Canonicalize . Need to evaluate if this change should be pulled in https://github.com/apache/spark/commit/0a6be8cf59 .

HaoYang670 commented 2 years ago

I can work on it

HaoYang670 commented 2 years ago

Should we put the changes in Shim layers, or just modify the code in place?

jlowe commented 2 years ago

If I understand this correctly, I don't think it makes sense to do this change unless we really need any performance gain from it, as it would . The original PR modified Expression and relies on a new method to be implemented by expressions. Until Spark 3.3 is the minimum Spark spec we support, we cannot rely on Expression objects providing or having this method, because they will derive from Spark's Expression and old Spark versions don't have the required method definition. We could override it in our ShimExpression wrappers, but this doesn't handle Expression objects coming from the original CPU plan.

I think we would need to see a demonstrated need for this (i.e.: some performance metrics showing how much can be gained on real-world queries) before prioritizing this work.