apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

[R] Implementing tidyr interface #24956

Open asfimport opened 4 years ago

asfimport commented 4 years ago

I think it would be reasonable to implement an interface to the tidyr package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine:


library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest()

The main focus might be the following three methods:

Reporter: Dominic Dennenmoser

Note: This issue was originally created as ARROW-8813. Please see the migration documentation for further details.

asfimport commented 4 years ago

Neal Richardson / @nealrichardson: If you wanted to explore this, one challenge I see is that pivot_longer and pivot_wider aren't generics, so you can't just make arrow methods for them.

asfimport commented 4 years ago

Dominic Dennenmoser: Thanks for refering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of pivot_[longer|wider]() will be available in the upcoming version of tidyr, and is already implemented into the development version (#800).

asfimport commented 2 years ago

Nigel McKernan: The issue [~domiden] references was committed into tidyr  1.1.0 back in May of 2020, as you can see [here](https://github.com/tidyverse/tidyr/releases#:~:text=pivot_longer()%20and%20pivot_wider()%20are%20now%20generic%20so%20implementations%0Acan%20be%20provided%20for%20objects%20other%20than%20data%20frames), more than 2 years ago.

 

Would it be possible now to incorporate some tidyr methods that have been converted to generics into {}arrow{}?

EDIT: As well, the nest() generic is now [lazily-evaluated](https://github.com/tidyverse/tidyr/releases#:~:text=The%20nest()%20generic%20now%20avoids%20computing%20on%20.data%2C%20making%20it%20more%0Acompatible%20with%20lazy%20tibbles), making it easier to do remote operations, as of the tidyr 1.2.0 release earlier this year.

eitsupi commented 1 year ago

Related to #34265