apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.69k stars 3.56k forks source link

[R] Implement anonymous functions in calls to dplyr::across #43207

Open TPDeramus opened 4 months ago

TPDeramus commented 4 months ago

Describe the usage question you have. Please include as many useful details as possible.

Hi Arrow devs.

I wanted to ask about something I noticed about using the column-wise operators with dplyr in arrow tables.

If I had an arrow table, and I wanted to run a basic function such as mean, max, or min using summarize, it appears that arrow does not currently accept the na.rm = TRUE argument, or that if it does, I can't seem to find it in the documentation.

Say I took the original dataset:

Participant Rating
Donna 17
Donna NA
Greg 21
Greg NA

If these were generic R dataframes, either of these two calls would work (though one is deprecated):

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()
Producing: Participant Rating
Donna 17
Greg 21

However, when I run the same commands as an arrow table, both throw errors:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), \(x) max(x, na.rm = TRUE))) |>
  as.data.frame()

Error in `across_setup()`:
! Anonymous functions are not yet supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max, na.rm = TRUE)) |>
  as.data.frame()

Error in `expand_across()`:
! `...` argument to `across()` is deprecated in dplyr and not supported in Arrow
Run `rlang::last_trace()` to see where the error occurred.

And the one that does work:

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), max)) |>
  as.data.frame()
Returns NA values that are not what I want: Participant Rating
Donna NA
Greg NA

Is there a way to pass the na.rm = TRUE argument to this call without having to manually drop the NA values for each column or row of interest I have in my data?

Component(s)

R

TPDeramus commented 4 months ago

I'm sure there's a way it could be done with arrow_max or call_function, but that is not readily apparent to me either and also keeps throwing function does not exist errors (probably due to it being nested in group_by, summarize and across).

TPDeramus commented 4 months ago

Hi Arrow Devs.

Some individuals in the Posit forums found a solution and it prompted some discussion we thought might be worth sending your way: https://forum.posit.co/t/arrow-with-tidyverse-calling-min-max-mean-with-summarize-on-arrow-tables/188985

"dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed."

data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Rating = c(21, NA, 17, NA)
) |>
  as_arrow_table() |>
  group_by(Participant) |>
  summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
  as.data.frame()
##   Participant Rating
## 1        Greg     21
## 2       Donna     17

"_I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.

With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?_"

Would any of you be willing to explain how this works on the backend?

Happy to pass it on.

thisisnic commented 4 months ago

Thanks for reporting this @TPDeramus!

In short, in the backend, arrow code converts the dplyr code into Arrow Expressions. In the case of the across() implementation, from what I recall, we just work out the individual calls and then our mutate() implementation later converts that into Arrow Expressions.

Great you've got a workaround here, I'll take a look at implementing anonymous functions at some point in future, as it'll be useful to have and now we have better support for that kind of thing in arrow than we used to.