apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Is pre-compile pattern string in regexp_match operation #11146

Open zhuliquan opened 4 days ago

zhuliquan commented 4 days ago

Is your feature request related to a problem or challenge?

I noticed below code: https://github.com/apache/datafusion/blob/6a4a280e3cf70fe5f1a1cfe7c2de13e4c39f89bb/datafusion/physical-expr/src/expressions/binary.rs#L537-L564 This looks like every time record_batch is evaluated, it will execute the compiled pattern string and use the compiled results to match arrow-array

Describe the solution you'd like

when building binary physical expr , we can pre-compile pattern string if op is regex_match

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 3 days ago

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for https://github.com/apache/datafusion/issues/8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

zhuliquan commented 3 days ago

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for #8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch. https://github.com/apache/datafusion/blob/57280e42dc2391ab65c24c0fb52032942d3d85a8/datafusion/physical-expr/src/expressions/binary.rs#L60-L66

alamb commented 3 days ago

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch.

I think this makes sense to me. Thank you

One thing to look into might be to replace the regexp match operator with a function (and have the function's implementation store the precompiled regexp)

One tricky part might be serialization (in datafusion-proto). We can probably handle it via an extension codec or something.