NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663

Closed SurajAralihalli closed 3 weeks ago

SurajAralihalli commented 3 weeks ago

Resolves #11554, #7585

In cuDF, support for multiple newline characters was expanded from NEW_LINE (\n) to include the following:

PR #17139 introduced this change to cuDf JNI with RegexFlag::EXT_LINE. This PR simplifies the transpilation of $ by changing the pattern from (?:\r|\u0085|\u2028|\u2029|\r\n)?$ to the simpler (?:\r\n)?$ and updates all functions to use RegexFlag::EXT_LINE wherever this transpilation occurs.

This PR also drops support for $\z because \z is not supported by cuDf. Alternatively, we could transpile $\zto $(?![\r\n\u0085\u2028\u2029]). However, cuDf doesn't support negative look ahead.

This PR also drops support for regex patterns with end-of-line anchors $ and \Z when followed by any escape sequences like \W, \B,\b etc, as they produce different results on CPU and GPU.

NVnavkumar commented 3 weeks ago

Can we confirm some of the behavior described in compatibility.md and update accordingly?

SurajAralihalli commented 3 weeks ago

Can we confirm some of the behavior described in compatibility.md and update accordingly?

Thank you for pointing it, I found another issue that is resolved by this PR. I've updated the guide and tests to reflect this. As part of the process we also reviewed the feasibility of solving https://github.com/NVIDIA/spark-rapids/issues/10641 and https://github.com/NVIDIA/spark-rapids/issues/10764 in this PR. Updated these issues with the status.

SurajAralihalli commented 3 weeks ago

Build

SurajAralihalli commented 3 weeks ago

Build