apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.53k stars 3.54k forks source link

[C++] Add possibility to extract spans/byte offsets directly for `compute.extract_regex` #44615

Open rtbs-dev opened 5 days ago

rtbs-dev commented 5 days ago

Describe the enhancement requested

I'm Coming from AwkwardArray and Polars use, trying to vectorize the equivalent of finding the byte offsets (or character spans) of all regex matches in an array of strings.

See this discussion for the request's solution in re2 directly. Per the solution there, it seems the information would be contained in the re2::StringPiece data, which this thread indicated is preferable anyway, due to memory duplication. I see something vaguely related brought up here, where string_view was vendored instead, though I don't see a way to access the view objects right now, via the results of extract_regex.

I do see the struct getting returned is not a plain string, but adding span locations might mess with downstream users' type definitions or API contracts. Maybe new behavior could be added as an additional option? Alternatively, a new function extract_regex_spans would already make my life much easier, even if downstream libraries like Polars and AkwardArray have add new wrapper APIs for their code to support the behavior.

Am I missing something obvious? I most importantly want to avoid having to loop twice over every string (first to find the string match and then to find the location of the previous match) because that feels wasteful when the matches are discovered via their offset locations in the first place, right?

Thanks!

Component(s)

C++

rtbs-dev commented 5 days ago

Linking to an issue I opened with arrow-rs, which uses a rust regex implementation iirc, when I was only considering polars use-cases. But I'm now not sure how I would implement this in the re2 case, either.

https://github.com/apache/arrow-rs/issues/5966#issuecomment-2196749863