NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
749 stars 221 forks source link

[FOLLOW ON] Combining regex parsing in transpiling and regex rewrite in `rlike` #10817

Closed thirtiseven closed 3 weeks ago

thirtiseven commented 1 month ago
          nit: Could we have a follow on issue to figure out how to parse the regexp once instead of multiple times?

Originally posted by @revans2 in https://github.com/NVIDIA/spark-rapids/pull/10715#discussion_r1600186097_

For regex operations like rlike, the input regex is transpiled to ast, then another regex string that is supported by cuDF, if it can't be transpiled, then the regex operation falls back to cpu. Transpiling is performed in tagExprForGpu.

10715 introduced regex rewrite for rlike, it also needs to parse a regex string to ast in convertToGpu. This operation can be combined with the parsing in transpiling to save time and make the code cleaner.

We can refactor the transpiler code to split it into two steps: regex string to ast and ast to new regex string, and then move the regex rewrite to tagExprForGpu and then save the optimization type in Meta.