Open viadea opened 1 month ago
@viadea the error message is coming from
Which is essentially saying that an expression cannot start with a *
, +
, or ?
character. This appears to be totally valid, except when it is at the start of a group. Our group parsing code appears to only support non-capture groups.
Which is only what CUDF appears to also support https://docs.rapids.ai/api/cudf/stable/libcudf_docs/md_regex/#groups
But java patterns https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html appear to support many other types of capture groups, which result in this error.
spark.range(10).selectExpr("split(id, '(?<foo>1)')").show()
spark.range(10).selectExpr("split(id, '(?i:1)')").show()
spark.range(10).selectExpr("split(id, '(?=1)')").show()
spark.range(10).selectExpr("split(id, '(?!1)')").show()
spark.range(10).selectExpr("split(id, '(?<=1)')").show()
spark.range(10).selectExpr("split(id, '(?<!1)')").show()
spark.range(10).selectExpr("split(id, '(?>1)')").show()
As each of these are rather complex to test/implement is there any way that you could clarify which of these is needed?
I wish split function support "Base expression cannot start with quantifier near index 1".
The reproduce:
The fallback is error is:
Note: here the
xxx
is the pattern which i can only share internally since it is from user code.