apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.88k stars 1.11k forks source link

Add additional regexp functions #11946

Open timsaucer opened 1 month ago

timsaucer commented 1 month ago

Is your feature request related to a problem or challenge?

I would like to see the following regexp functions implemented. These exist in some, but not all, versions of PostgreSQL.

Describe the solution you'd like

Implement these functions.

Describe alternatives you've considered

These operations can be performed using the existing functions, so I am currently unblocked for my immediate use case but having these functions built in would be convenient.

Additional context

We currently have the following regexp functions implemented. The source is in datafusion/functions/src/regex/mod.rs

regexp_like() regexp_match() regexp_replace()

xinlifoobar commented 3 weeks ago

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

alamb commented 3 weeks ago

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

Thanks @xinlifoobar

I would personally recommend we start implementing them in datafusion as that will avoid the need to wait for coordinated releases of arrow-rs, and then port backupstream to arrow-rs as a follow on step.

alamb commented 3 weeks ago

@xinlifoobar I suspect there will be several other contribtuors interested in helping out and learning during the process. If we have a good example to follow the work would be straightforward to scale I think

One way to do this might be:

  1. You implement one of these functions in a PR, along with good docs, tests, etc
  2. Then we can file additional tickets for the other functions, linking to your first implementation
nrc commented 3 weeks ago

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?

The currently accepted argument types are:

                    Exact(vec![Utf8, Int64]),
                    Exact(vec![LargeUtf8, Int64]),
                    Exact(vec![Utf8, Int64, Int64]),
                    Exact(vec![LargeUtf8, Int64, Int64]),
                    Exact(vec![Utf8View, Int64]),
                    Exact(vec![Utf8View, Int64, Int64]),

Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

Omega359 commented 3 weeks ago

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it? Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

Spark's version of this is https://spark.apache.org/docs/latest/api/sql/#regexp_substr

timsaucer commented 3 weeks ago

Based on prior conversations it sounds like the group is most interested in making sure we are supporting Postgresql so I think adding this is a very good idea. We can also have regexp_substr as an alias.