TysonStanley / tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://tysonbarrett.com/tidyfast/
187 stars 4 forks source link

WIP: dtplyr support via .dtplyr_step methods #36

Closed vincentarelbundock closed 4 years ago

vincentarelbundock commented 4 years ago

I marked this as WIP because it needs review and because I need to look at it myself with fresh eyes (not tonight). It would be great if @TysonStanley could look at this and let me know what you think of the strategy, and if you see any red flags before I sink more time into this. [edit: obviously errors will have to be fixed, but let's just think about implementation for now]

Right now, pretty much everything seems to work as expected. You can try it out by installing the branch and running these examples. You can also try running the code in the new tests.

remotes::install_github("vincentarelbundock/tidyfast@dtplyr")

library(tidyfast)
library(dtplyr)
library(magrittr)
library(dplyr)

dat <- lazy_dt(dplyr::starwars)

dat %>% mutate(black_hair = hair_color == "black") %>%
        select(name, hair_color, eye_color) %>%
        dt_pivot_longer(-name) %>%
        tail
#> Duplicate column names found in molten data.table. Setting unique names using 'make.names'
#> Source: local data table [?? x 3]
#> Call:   tail(`_DT2`, n = 6L)
#> 
#>   name           name.1    value  
#>   <chr>          <chr>     <chr>  
#> 1 Finn           eye_color dark   
#> 2 Rey            eye_color hazel  
#> 3 Poe Dameron    eye_color brown  
#> 4 BB8            eye_color black  
#> 5 Captain Phasma eye_color unknown
#> 6 Padmé Amidala  eye_color brown  
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

dat %>% 
    dt_nest(gender) %>%
    mutate(ncharacters = sapply(data, nrow))
#> Source: local data table [?? x 3]
#> Call:   copy(`_DT3`)[, `:=`(ncharacters = sapply(data, ..nrow))]
#> 
#>   gender    data                        ncharacters
#>   <chr>     <list>                            <int>
#> 1 <NA>      <data.table[,13] [4 × 13]>            4
#> 2 feminine  <data.table[,13] [17 × 13]>          17
#> 3 masculine <data.table[,13] [66 × 13]>          66
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Created on 2020-08-18 by the reprex package (v0.3.0)

vincentarelbundock commented 4 years ago

Alright, I'm stumped. I get no error when I run the test suite locally, but I get an error on Travis. That error is related to a perceived inconsistency:

* checking S3 generic/method consistency ... WARNING
dt_unnest:
  function(dt_, col)
dt_unnest.dtplyr_step:
  function(dt_, col, fill, ...)

which triggers an error:

  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  > library(data.table)
  > 
  > test_check("tidyfast")
  ── 1. Error: dt_unnest.dtplyr_step (@test-dtplyr.R#53)  ────────────────────────
  unused argument (fill = FALSE)
  Backtrace:
   1. tidyfast::dt_unnest(dat, col = data)
   2. tidyfast:::dt_unnest.dtplyr_step(dat, col = data)

But again, I don't get the error locally, and dt_unnest, dt_unnest.default and dt_unnest.dtplyr_step all have the same arguments:

https://github.com/vincentarelbundock/tidyfast/blob/dtplyr/R/unnest.R#L25

Not sure what is going on.

vincentarelbundock commented 4 years ago

Another (radical?) thing to consider would also be to export pivot_wider.dtplyr_step, unnest.dtplyr_step, etc. methods so people's code would only have to load tidyfast, and their code would look like "normal" tidyr code (with the exception of the initial lazy_dt()) call.

Not sure if there are drawbacks or this would be well-received upstream, but it would be super cool and convenient for users.

TysonStanley commented 4 years ago

That is a super interesting idea. I'll need to think about that but it would be seamless and if they are loading tidyfast and using lazy_dt() it shouldn't be a surprise.