Closed SurajAralihalli closed 3 weeks ago
Can we confirm some of the behavior described in compatibility.md and update accordingly?
Can we confirm some of the behavior described in compatibility.md and update accordingly?
Thank you for pointing it, I found another issue that is resolved by this PR. I've updated the guide and tests to reflect this. As part of the process we also reviewed the feasibility of solving https://github.com/NVIDIA/spark-rapids/issues/10641 and https://github.com/NVIDIA/spark-rapids/issues/10764 in this PR. Updated these issues with the status.
Build
Build
Resolves #11554, #7585
In cuDF, support for multiple newline characters was expanded from NEW_LINE (
\n
) to include the following:\u0085
)\u2028
)\u2029
)\r
)\n
)PR #17139 introduced this change to cuDf JNI with
RegexFlag::EXT_LINE
. This PR simplifies the transpilation of$
by changing the pattern from(?:\r|\u0085|\u2028|\u2029|\r\n)?$
to the simpler(?:\r\n)?$
and updates all functions to useRegexFlag::EXT_LINE
wherever this transpilation occurs.This PR also drops support for
$\z
because\z
is not supported by cuDf. Alternatively, we could transpile$\z
to$(?![\r\n\u0085\u2028\u2029])
. However, cuDf doesn't support negative look ahead.This PR also drops support for regex patterns with end-of-line anchors
$
and\Z
when followed by any escape sequences like\W
,\B
,\b
etc, as they produce different results on CPU and GPU.