Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 987 forks source link

Using `Map` instead of `lapply` turns GForce off #5336

Open grantmcdermott opened 2 years ago

grantmcdermott commented 2 years ago

After recently extolling the virtues of base::Map over lapply (quicker to type, direct equivalents in other languages, easy to add multiple arguments, etc.), I was a little surprised by the following:

library(data.table) 
library(microbenchmark)

flights = fread('https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv')

vars = c('dep_delay', 'arr_delay', 'air_time', 'distance')

microbenchmark(
  lapply = flights[, lapply(.SD, sum), by=.(month, day, origin, dest), .SDcols=vars],
  Map = flights[, Map(sum, .SD), by=.(month, day, origin, dest), .SDcols=vars],
  times = 2
)
#> Unit: milliseconds
#>    expr        min         lq       mean     median         uq        max neval
#>  lapply   12.31694   12.31694   14.19584   14.19584   16.07475   16.07475     2
#>     Map 1535.70696 1535.70696 1549.11904 1549.11904 1562.53112 1562.53112     2
#>  cld
#>   a 
#>    b

Created on 2022-02-16 by the reprex package (v2.0.1)

Re-running the Map version with verbose=TRUE, the expected culprit appears:

#> ...
#> lapply optimization is on, j unchanged as 'Map(sum, .SD)'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... The result of j is a named list. It's 
#> very inefficient to create the same names over and over again for each group. When 
#> j=list(...), any names are detected, removed and put back after grouping has 
#> completed, for efficiency. Using j=transform(), for example, prevents that speedup 
#> (consider changing to :=). This message may be upgraded to warning in future.

Is there a technical reason why Map shouldn't/wouldn't work with GForce, but lapply does?

Related (possibly even a duplicate): https://github.com/Rdatatable/data.table/issues/4225

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4.9 data.table_1.14.3   

loaded via a namespace (and not attached):
[1] compiler_4.1.2 tools_4.1.2    parallel_4.1.2
ben-schwen commented 2 years ago

From the internal side, this is not due to GForce, but due to not applying lapply optimization in the first place.

In your example this would be:

# lapply optimization changed j from 'lapply(.SD, sum)' to 'list(sum(dep_delay), sum(arr_delay), sum(air_time), sum(distance))'

It's possible to also do Map and mapply optimizations by expanding expressions to their list equivalences. If correctly expanded, GForce should smoothly take over from there.

In the easiest case of unary functions this should be just a small change.

Related to #5032 and #3815

MichaelChirico commented 2 years ago

I think if we had squeaky-clean internals on [ this would be pretty straightforward... at the moment I think it might be messy.

Leaving this as a "PRs accepted" request for now...