apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[R] stringr binding for `str_starts` fails based on syntax #43336

Open TPDeramus opened 3 months ago

TPDeramus commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Hi Arrow Devs.

I noticed an odd behavior for str_starts() for the package.

So if you make an arrow table:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21)) |> as_arrow_table()
Participant Category Rating
Greg F0 21
Greg C0.0 NA
Donna 1 17
Donna 01 21

And want to filter the output to entries that start with several specific strings like so

filterlist <- c("F", "C", "1")

Participant Category Rating
Greg F0 21
Greg C0.0 NA
Donna 1 17

If I run a call like this one, it either fails or pulls it into R:

df |>
  filter(str_starts(Category, paste(filterlist, collapse = "|")))
Warning: Expression str_starts(Category, paste(filterlist, collapse = "|")) not supported in Arrow; pulling data into R

But these two will run just fine and produce the desired output:

df |>
  filter(str_starts(Category, "F|C|1"))

df |>
  filter(str_starts(Category, filtervar))

Is this a bug of some kind?

Component(s)

R

thisisnic commented 3 months ago

Thanks for reporting this @TPDeramus and glad you've got a workaround there! This might be a gap in our implementation where we may be missing evaluating the filter condition.