NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
782 stars 228 forks source link

[FEA] Support single '$' or '^' on right side of regexp choice #10764

Open NVnavkumar opened 3 months ago

NVnavkumar commented 3 months ago

Is your feature request related to a problem? Please describe. I wish the RAPIDS Accelerator for Apache Spark would support a single '$' or '^' on the right side of a regexp choice.

Simple reproduce:

scala> List("B=A:", "B=bA:").toDF("a").write.mode("overwrite").parquet("/tmp/test_re_dollar_choice.parquet")
24/05/03 20:33:52 WARN GpuOverrides:
    ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
      @Expression <AttributeReference> a#25 could run on GPU

scala> spark.read.parquet("/tmp/test_re_dollar_choice.parquet").selectExpr("regexp_extract(a, 'B\=(.*?)(\\:|$)', 1)").collect()
24/05/03 20:34:14 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> regexp_extract(a#28, B=(.*?)(:|$), 1) AS regexp_extract(a, B=(.*?)(:|$), 1)#30 could run on GPU
    !Expression <RegExpExtract> regexp_extract(a#28, B=(.*?)(:|$), 1) cannot run on GPU because regex group count is 0, but the specified group index is 1; Sequences that only contain '^' or '$' are not supported near index 10
      @Expression <AttributeReference> a#28 could run on GPU
      @Expression <Literal> B=(.*?)(:|$) could run on GPU
      @Expression <Literal> 1 could run on GPU

res5: Array[org.apache.spark.sql.Row] = Array([bA], [A])

Describe the solution you'd like

The above regular expression query should run on the GPU.

gerashegalov commented 3 months ago

For some reason the number of capturing groups in selectExpr regexp_extract(a, '(\\:|$)', 1) and the GpuOverrides output regexp_extract(a#28, B=(.*?)(:|$), 1) are different

NVnavkumar commented 3 months ago

For some reason the number of capturing groups in selectExpr regexp_extract(a, '(\\:|$)', 1) and the GpuOverrides output regexp_extract(a#28, B=(.*?)(:|$), 1) are different

Good catch, fixed. I had copy/pasted the wrong line.

NVnavkumar commented 3 months ago

This will involve incorporating the existing $ transpilation into the choice expression.

NVnavkumar commented 3 months ago

Might require the implementation of https://github.com/rapidsai/cudf/issues/15746