NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
785 stars 228 forks source link

[FEA] Support Databricks ephemeralsubstring #4041

Closed viadea closed 1 year ago

viadea commented 2 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]

On Databricks, functionsubstr will be repalced by a new expression named ephemeralsubstring. So it will fallback due to below Driver log message:

!NOT_FOUND <EphemeralSubstring> ephemeralsubstring(name#48, 1, 1) cannot run on GPU because no GPU enabled version of expression class com.databricks.sql.optimizer.EphemeralSubstring could be found

Mini repro:

Seq("a", "b").toDF("name").write.format("parquet").mode("overwrite").save("/tmp/testparquet")
spark.read.parquet("/tmp/testparquet").createTempView("df")
spark.sql("select * from df where substr(name,1,1)='a'").explain()

This is a feature request to support this on GPU.

Note: This is for tracking purpose since I know the databricks expressions are blackbox to us right now.

viadea commented 2 years ago

Seems https://github.com/NVIDIA/spark-rapids/issues/1563 has found the similar issue before.

viadea commented 2 years ago

The workaround is: this conversion can be workaround-ed by setting:

spark.databricks.optimizer.reduceSubstringMaterialization false
mattahrens commented 1 year ago

Fixed by https://github.com/NVIDIA/spark-rapids/pull/7797