apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.5k stars 1.02k forks source link

Do we need to escape search string as it's used in regexp? Wondering what's the result of `contains("abcdefg", ".*")` #10929

Open alamb opened 2 weeks ago

alamb commented 2 weeks ago
          Do we need to escape search string as it's used in regexp? Wondering what's the result of `contains("abcdefg", ".*")`

_Originally posted by @waynexia in https://github.com/apache/datafusion/pull/10879#discussion_r1635767599_

alamb commented 2 weeks ago

cc @Lordworms

Lordworms commented 2 weeks ago

Sorry for the late review since I was busy this week. In the beginning, I was just trying to keep the same format as other ScalarUDF which utilize arrow-rs methods to implement functionality so I just chose arrow::reglike. I can fix it to use str.contains

alamb commented 2 weeks ago

I was just trying to keep the same format as other ScalarUDF which utilize arrow-rs methods to implement functionality so I just chose arrow::reglike.

Makes sense. I think in this case, however, the function shouldn't actually have regexp support, so it would be better to use str.contains

I can fix it to use str.contains

Thank you!

waynexia commented 2 weeks ago

Thanks @Lordworms. I'm thinking about @alamb's idea https://github.com/apache/datafusion/pull/10879#discussion_r1636789004, which only implements contains as a placeholder for translating/planning. And it would finally become some other thing like LIKE after the optimization phase. A similar thing is Expr::Wildcard, it's to reflect * from SQL but doesn't have a corresponding physical expr.

alamb commented 2 weeks ago

One benefit of using LIKE is that it already has a highly optimized arrow implementation (as in will actually use substring if the patttern looks like substr% etc).

Lordworms commented 1 week ago

I'll refactor it to use LIKE