Open rtbs-dev opened 3 weeks ago
Linking to an issue I opened with arrow-rs, which uses a rust regex implementation iirc, when I was only considering polars use-cases. But I'm now not sure how I would implement this in the re2 case, either.
https://github.com/apache/arrow-rs/issues/5966#issuecomment-2196749863
Describe the enhancement requested
I'm Coming from AwkwardArray and Polars use, trying to vectorize the equivalent of finding the byte offsets (or character spans) of all regex matches in an array of strings.
See this discussion for the request's solution in re2 directly. Per the solution there, it seems the information would be contained in the
re2::StringPiece
data, which this thread indicated is preferable anyway, due to memory duplication. I see something vaguely related brought up here, wherestring_view
was vendored instead, though I don't see a way to access the view objects right now, via the results ofextract_regex
.I do see the struct getting returned is not a plain string, but adding span locations might mess with downstream users' type definitions or API contracts. Maybe new behavior could be added as an additional option? Alternatively, a new function
extract_regex_spans
would already make my life much easier, even if downstream libraries like Polars and AkwardArray have add new wrapper APIs for their code to support the behavior.Am I missing something obvious? I most importantly want to avoid having to loop twice over every string (first to find the string match and then to find the location of the previous match) because that feels wasteful when the matches are discovered via their offset locations in the first place, right?
Thanks!
Component(s)
C++