NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
806 stars 234 forks source link

[FEA] split support "Base expression cannot start with quantifier near index 1" #11460

Open viadea opened 1 month ago

viadea commented 1 month ago

I wish split function support "Base expression cannot start with quantifier near index 1".

The reproduce:

sc.makeRDD(1 to 10000, 6).toDF.createOrReplaceTempView("df")
spark.sql("select split(value, '(xxx)') from df").show

The fallback is error is:

        !Expression <StringSplit> split(cast(value#2 as string), (xxx), -1) cannot run on GPU because Base expression cannot start with quantifier near index 1

Note: here the xxx is the pattern which i can only share internally since it is from user code.

revans2 commented 1 month ago

@viadea the error message is coming from

https://github.com/NVIDIA/spark-rapids/blob/502f5a3cd96e458c8471794af9d2e209d9f0b42f/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala#L160-L162

Which is essentially saying that an expression cannot start with a *, +, or ? character. This appears to be totally valid, except when it is at the start of a group. Our group parsing code appears to only support non-capture groups.

https://github.com/NVIDIA/spark-rapids/blob/502f5a3cd96e458c8471794af9d2e209d9f0b42f/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala#L170-L173

Which is only what CUDF appears to also support https://docs.rapids.ai/api/cudf/stable/libcudf_docs/md_regex/#groups

But java patterns https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html appear to support many other types of capture groups, which result in this error.

As each of these are rather complex to test/implement is there any way that you could clarify which of these is needed?