apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.42k stars 3.51k forks source link

[R] perl operators in regular expressions #40220

Open dhicks opened 7 months ago

dhicks commented 7 months ago

Describe the enhancement requested

R 4.3, arrow 14.0.0.2 (most recent Mac OS binary; apologies in advance if this is already supported in source)

arrow can't handle perl operators, such as negative lookaheads, in regular expressions, at least via dplyr and stringr:

library(arrow)
library(dplyr)
library(stringr)

ar = data.frame(text = c('Lorem ipsum dolor sit amet', 
                         'Lorem dolor ipsum sit amet')) |> 
    as_arrow_table()

## Works, returns both rows
ar |> 
    filter(str_detect(text, 'Lorem [^(ipsum)]')) |> 
    collect()

## Should only return the second row
## Error in `compute.arrow_dplyr_query()`:
## ! Invalid: Invalid regular expression: invalid perl operator: (?!
ar |> 
    filter(str_detect(text, regex('Lorem(?! ipsum)')))
    collect()

Component(s)

R

assignUser commented 7 months ago

Without looking at the code, so not a definitive answer, but I am pretty sure that re2 the C++ library used in acero doesn't support lookahead so this is probably not something that can be added.