NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
728 stars 216 forks source link

[FEA] Rewrite some regular expressions `RLIKE` cases to faster expressions #10741

Open thirtiseven opened 3 weeks ago

thirtiseven commented 3 weeks ago

Is your feature request related to a problem? Please describe.

RLIKE, REGEXP or REGEXP_LIKE are widely used but very expensive, and many use cases are quite simple and similar, like ^abc(.*), (.*)abc$, pattern, and ^pattern$.

These pattern cases can be replaced with faster expressions like GpuStartsWith, GpuEndsWith or GpuContains when overriding.

Some commonly used patterns are even worth to writing a custom kernel to match, such as pattern[0-9]{3,4} (some digits followed by a string) and [\u4e00-\u9fa5]+ (any Chinese character).

Describe the solution you'd like We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.

Here is a list of planned/possible tasks:

revans2 commented 3 weeks ago

Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.

I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.

thirtiseven commented 3 weeks ago

Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.

pattern[0-9]{3,4} is a misleading example, I am working on a kernel that will match pattern[0-9]{X, Y} (so it will also match [0-9]{X, Y} by setting pattern = "").

I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.

Yes, this kernel can be more general so these two cases can be combined. Thanks for catching this!

revans2 commented 2 weeks ago

I still don't see a lot of generality for STATIC_STRING[RANGE]{X, Y}.