Open TPDeramus opened 4 months ago
I'm sure there's a way it could be done with arrow_max
or call_function
, but that is not readily apparent to me either and also keeps throwing function does not exist
errors (probably due to it being nested in group_by
, summarize
and across
).
Hi Arrow Devs.
Some individuals in the Posit forums found a solution and it prompted some discussion we thought might be worth sending your way: https://forum.posit.co/t/arrow-with-tidyverse-calling-min-max-mean-with-summarize-on-arrow-tables/188985
"dplyr::across() also supports a purrr-style lambda definition, which strangely seems to work in arrow where the other methods failed."
data.frame(
Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
Rating = c(21, NA, 17, NA)
) |>
as_arrow_table() |>
group_by(Participant) |>
summarize(across(matches("Rating"), ~max(.x, na.rm = TRUE))) |>
as.data.frame()
## Participant Rating
## 1 Greg 21
## 2 Donna 17
"_I'm not sure at what points the operations become outsourced to arrow methods, but I don't know whether the ~min(.x, ...) lambda notation somehow tricks dplyr into not outsourcing this operation to arrow.
With dbplyr, everything is converted to SQL queries instead and you can view the SQL query to check it. Is there an equivalent arrow command that lets you see what commands are sent to arrow?_"
Would any of you be willing to explain how this works on the backend?
Happy to pass it on.
Thanks for reporting this @TPDeramus!
In short, in the backend, arrow code converts the dplyr code into Arrow Expressions. In the case of the across()
implementation, from what I recall, we just work out the individual calls and then our mutate()
implementation later converts that into Arrow Expressions.
Great you've got a workaround here, I'll take a look at implementing anonymous functions at some point in future, as it'll be useful to have and now we have better support for that kind of thing in arrow than we used to.
Describe the usage question you have. Please include as many useful details as possible.
Hi Arrow devs.
I wanted to ask about something I noticed about using the column-wise operators with
dplyr
inarrow
tables.If I had an arrow table, and I wanted to run a basic function such as
mean
,max
, ormin
usingsummarize
, it appears thatarrow
does not currently accept thena.rm = TRUE
argument, or that if it does, I can't seem to find it in the documentation.Say I took the original dataset:
If these were generic
R
dataframes, either of these two calls would work (though one is deprecated):However, when I run the same commands as an arrow table, both throw errors:
And the one that does work:
NA
values that are not what I want:Is there a way to pass the
na.rm = TRUE
argument to this call without having to manually drop theNA
values for each column or row of interest I have in my data?Component(s)
R