apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.51k stars 747 forks source link

access `regex::Match` or `Captures` struct impl results (e.g. range, start, end, etc) #5966

Open rtbs-dev opened 3 months ago

rtbs-dev commented 3 months ago

What are you trying to do? I would like to extract regex::Captures structs directly, rather than the already-unwrapped string values, because I require the byte offsets directly (e.g. to implement ISO 24612, which requires primacy of string span locations in a document, not the contents themselves).

Describe the solution you'd like

Either a new function to extract the Captures structs directly, or a mode for compute::regexp_match that provides the offset anchors for each match.

Describe alternatives you've considered

Retrieving the strings and then trying to find their locations one-by-one is wasteful of resources, and I can't find a flag to enable the desired behavior :)

Additional context

Coming here from polars#16341, but if I'm understanding their codebase correctly, they are using this backend as an intermediary to the regex rust lib.

tustvold commented 3 months ago

I don't believe polars is making use of arrow-rs, in favour of its own implementation.

That being said, if you're wanting to do advanced regex have you considered just iterating the arrays manually and applying the regex?

Edit: I've filed https://github.com/apache/arrow-rs/issues/5991 to track this further