Open thirtiseven opened 3 weeks ago
Actually pattern[0-9]{3,4}
is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.
I would rather see us write a custom kernel for [A-B]{X,Y}
where A
and B
are any character value and X
and Y
are any integer value, including null
which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,}
({1,}
is the same as +
, one or more). It would also let us do things like 0 or more digits [0-9]{0,}
({0,}
is the same as *
, zero or more). Or any printable ASCII character. Or really any possible range of values.
Actually
pattern[0-9]{3,4}
is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.
pattern[0-9]{3,4}
is a misleading example, I am working on a kernel that will match pattern[0-9]{X, Y}
(so it will also match [0-9]{X, Y}
by setting pattern = "").
I would rather see us write a custom kernel for
[A-B]{X,Y}
whereA
andB
are any character value andX
andY
are any integer value, includingnull
which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character[\u4e00-\u9fa5]{1,}
({1,}
is the same as+
, one or more). It would also let us do things like 0 or more digits[0-9]{0,}
({0,}
is the same as*
, zero or more). Or any printable ASCII character. Or really any possible range of values.
Yes, this kernel can be more general so these two cases can be combined. Thanks for catching this!
I still don't see a lot of generality for STATIC_STRING[RANGE]{X, Y}
.
Is your feature request related to a problem? Please describe.
RLIKE
,REGEXP
orREGEXP_LIKE
are widely used but very expensive, and many use cases are quite simple and similar, like^abc(.*)
,(.*)abc$
,pattern
, and^pattern$
.These pattern cases can be replaced with faster expressions like
GpuStartsWith
,GpuEndsWith
orGpuContains
when overriding.Some commonly used patterns are even worth to writing a custom kernel to match, such as
pattern[0-9]{3,4}
(some digits followed by a string) and[\u4e00-\u9fa5]+
(any Chinese character).Describe the solution you'd like We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.
Here is a list of planned/possible tasks: