DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
https://diskframe.com
Other
595 stars 40 forks source link

warnings when using new github dtplyr #181

Open kendonB opened 5 years ago

kendonB commented 5 years ago
library(disk.frame)
library(dtplyr)
library(tidyverse)
iris_df = as.disk.frame(iris)
iris_df %>% 
  filter(Sepal.Length > 7) %>% 
  collect()
#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.

#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.

#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.

#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.

#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.

#> Warning: You are using a dplyr method on a raw data.table, which will call
#> the data frame implementation, and is likely to be inefficient.
#> 
#> To suppress this message, either generate a data.table translation
#> with `lazy_dt()` or convert to a data frame or tibble with
#> `as.data.frame()`/`as_tibble()`.
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1           7.1         3.0          5.9         2.1 virginica
#> 2           7.6         3.0          6.6         2.1 virginica
#> 3           7.3         2.9          6.3         1.8 virginica
#> 4           7.2         3.6          6.1         2.5 virginica
#> 5           7.7         3.8          6.7         2.2 virginica
#> 6           7.7         2.6          6.9         2.3 virginica
#> 7           7.7         2.8          6.7         2.0 virginica
#> 8           7.2         3.2          6.0         1.8 virginica
#> 9           7.2         3.0          5.8         1.6 virginica
#> 10          7.4         2.8          6.1         1.9 virginica
#> 11          7.9         3.8          6.4         2.0 virginica
#> 12          7.7         3.0          6.1         2.3 virginica

Created on 2019-09-24 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.6.1 (2019-07-05) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz Pacific/Auckland #> date 2019-09-24 #> #> - Packages -------------------------------------------------------------- #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> benchmarkme 1.0.2 2019-08-19 [1] #> benchmarkmeData 1.0.2 2019-08-19 [1] #> bigreadr 0.1.10 2019-09-17 [1] #> bit 1.1-14 2018-05-29 [1] #> bit64 0.9-7 2017-05-08 [1] #> broom 0.5.2 2019-04-07 [1] #> callr 3.2.0 2019-03-15 [1] #> cellranger 1.1.0 2016-07-27 [1] #> cli 1.1.0 2019-03-19 [1] #> codetools 0.2-16 2018-12-24 [2] #> colorspace 1.4-1 2019-03-18 [1] #> crayon 1.3.4 2017-09-16 [1] #> data.table 1.12.2 2019-04-07 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.0.2 2019-04-08 [1] #> digest 0.6.21 2019-09-20 [1] #> disk.frame * 0.1.1.999 2019-09-24 [1] #> doParallel 1.0.15 2019-08-02 [1] #> dplyr * 0.8.3 2019-07-04 [1] #> dtplyr * 0.0.3.9000 2019-09-24 [1] #> evaluate 0.13 2019-02-12 [1] #> forcats * 0.4.0 2019-02-17 [1] #> foreach 1.4.7 2019-07-27 [1] #> fs 1.3.1 2019-05-06 [1] #> fst 0.9.0 2019-04-09 [1] #> furrr 0.1.0 2018-05-16 [1] #> future 1.14.0 2019-07-02 [1] #> future.apply 1.3.0 2019-06-18 [1] #> generics 0.0.2 2018-11-29 [1] #> ggplot2 * 3.1.1 2019-04-07 [1] #> globals 0.12.4 2018-10-11 [1] #> glue 1.3.1 2019-03-12 [1] #> gtable 0.3.0 2019-03-25 [1] #> haven 2.1.0 2019-02-19 [1] #> highr 0.8 2019-03-20 [1] #> hms 0.4.2 2018-03-10 [1] #> htmltools 0.3.6 2017-04-28 [1] #> httr 1.4.1 2019-08-05 [1] #> iterators 1.0.12 2019-07-26 [1] #> jsonlite 1.6 2018-12-07 [1] #> knitr 1.23 2019-05-18 [1] #> lattice 0.20-38 2018-11-04 [2] #> lazyeval 0.2.2 2019-03-15 [1] #> listenv 0.7.0 2018-01-21 [1] #> lubridate 1.7.4 2018-04-11 [1] #> magrittr 1.5 2014-11-22 [1] #> Matrix 1.2-17 2019-03-22 [2] #> memoise 1.1.0 2017-04-21 [1] #> modelr 0.1.4 2019-02-18 [1] #> munsell 0.5.0 2018-06-12 [1] #> nlme 3.1-140 2019-05-12 [2] #> pillar 1.4.2 2019-06-29 [1] #> pkgbuild 1.0.3 2019-03-20 [1] #> pkgconfig 2.0.3 2019-09-22 [1] #> pkgload 1.0.2 2018-10-29 [1] #> plyr 1.8.4 2016-06-08 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.3.1 2019-05-08 [1] #> pryr 0.1.4 2018-02-18 [1] #> ps 1.3.0 2018-12-21 [1] #> purrr * 0.3.2 2019-03-15 [1] #> R6 2.4.0 2019-02-14 [1] #> Rcpp 1.0.2 2019-07-25 [1] #> readr * 1.3.1 2018-12-21 [1] #> readxl 1.3.1 2019-03-13 [1] #> remotes 2.0.4 2019-04-10 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.12 2019-03-14 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> rvest 0.3.4 2019-05-15 [1] #> scales 1.0.0 2018-08-09 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr * 1.4.0 2019-02-10 [1] #> testthat 2.1.1 2019-04-23 [1] #> tibble * 2.1.3 2019-06-06 [1] #> tidyr * 0.8.3 2019-03-01 [1] #> tidyselect 0.2.5 2018-10-11 [1] #> tidyverse * 1.2.1 2017-11-14 [1] #> usethis 1.5.0 2019-04-07 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.7 2019-05-14 [1] #> xml2 1.2.0 2018-01-24 [1] #> yaml 2.2.0 2018-07-25 [1] #> source #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> Github (xiaodaigh/disk.frame@0883715) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> Github (tidyverse/dtplyr@4d8d6da) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> #> [1] C:/Users/kmbel/Documents/R/win-library/3.6 #> [2] C:/Program Files/R/R-3.6.1/library ```
xiaodaigh commented 5 years ago

Thanks for this! I turned off dtplyr support before the release of v0.1.0 as there were many cases where dtplyr didn't work. I will look to find a solution for this as I think lazy_dt is a new API.

Here is a work around which is painful

# this is the way to get around it
aa = iris_df %>% 
  map(~{
    dtplyr::lazy_dt(.x) %>% 
      filter(Sepal.Length > 7) %>% 
      collect()
  }) %>% 
  collect

I will need to think about a good way to incorporate dtplyr which was in the original design. Also keen to start work on this once the new dtplyr is on CRAN

kendonB commented 5 years ago

I don't think you necessarily need to support dtplyr to address the current issue. The warnings seem to come about because you use data.table in the background then call dplyr verbs. You can probably fix this issue by converting to data.frame before running the dplyr functions.

xiaodaigh commented 5 years ago

I see. Good point. But the error only appears if you load dtplyr, so I take it to mean that if you turn on dtplyr then that's what you want to use, instead of converting to data.frame first? I think I should support dtplyr once it's on CRAN anyway, as a solidarity measure between tidyverse and data.table. :)

kendonB commented 5 years ago

I don't think you should assume that the user wants to use dtplyr just because it's loaded. The interface would ideally be as close to the in-memory interface as possible.

i.e. if the user were to call lazy_dt on the disk.frame object first, then I'd go ahead and call lazy_dt on the data.frame objects once they're in memory (once dtplyr is on CRAN). Otherwise, I wouldn't use data.table at all unless you have a really good reason. Not everything is faster in data.table; left_join, for example, I find is much better than the equivalent data.table merge.

xiaodaigh commented 5 years ago

Alright. Implementing a lazy_dt sounds reasonable because it's close to the dtplyr syntax.

kendonB commented 5 years ago

You might also be able to get them to change lazy_dt to a generic if you are quick!

xiaodaigh commented 5 years ago

Good thinking! See https://github.com/tidyverse/dtplyr/issues/105

xiaodaigh commented 5 years ago

Still need to implement lazy_dt. But wait for new dtplyr to go on CRAN first.

xiaodaigh commented 5 years ago

The latest disk.frame github version got rid of the warnings. But still need to do the lazy_dt implementation at some point.

xiaodaigh commented 5 years ago

You might also be able to get them to change lazy_dt to a generic if you are quick!

Mr Hadley had closed the issue and won't fix. I don't follow the logic exactly, but I don't feel like arguing. I think they are busy enough. I will figure out a way to accommodate.