DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
699 stars 40 forks source link

future_map much slower than purrr:map() #110

Closed edgBR closed 4 years ago

edgBR commented 4 years ago

Dear colleagues,

I am trying to map a time series object to every of the elements of a grouped tibble as follows:

library(lubridate)
library(odbc)
library(DBI)
library(ConfigParser)
library(aws.s3)
library(tidyverse)
library(tidyquant)
library(timetk)
library(sweep)
library(forecast)
library(furrr)
library(tictoc)
test <- lifeCounterFulldata %>% filter(n() >= 52*2) %>% 
  select(col1, col2, col3, col4, col5) %>%
  nest()
tic()
data_ts <- test %>%
  mutate(data.ts = map(.x       = data, 
                       .f       = tk_ts,
                       freq     = 52))
toc()

The elapsed time of this procedure is 29s. However when I tried to do it with furrr:

tic()
library(lubridate)
library(odbc)
library(DBI)
library(ConfigParser)
library(aws.s3)
library(tidyverse)
library(tidyquant)
library(timetk)
library(sweep)
library(forecast)
library(furrr)
library(tictoc)
no_cores <- availableCores() - 4 #my machine has 32 cores
plan(multicore, workers = no_cores)
data_ts <- test %>%
  mutate(data.ts = future_map(.x       = data, 
                       .f       = tk_ts,
                       freq     = 52, .progress = TRUE))
toc()

I am getting an elapsed time of 130s. Any idea of what I am doing wrong?

BR /Edgar

sessionInfo()


R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) 
Running under: Ubuntu 16.04.6 LTS  
Matrix products: default BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0 
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0  locale:  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                   [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C         

attached base packages: 
[1] stats     graphics  grDevices utils     datasets  methods   base       other attached packages:  [1] tictoc_1.0                 feather_0.3.5              DescTools_0.99.35          data.table_1.12.8           [5] 
furrr_0.1.0                future_1.17.0              forecast_8.12              sweep_0.2.2                 [9] timetk_1.0.0               tidyquant_1.0.0            quantmod_0.4.17            TTR_0.23-6                 [13] PerformanceAnalytics_2.0.4 xts_0.12-0                 zoo_1.8-8                  forcats_0.5.0              [17] 
stringr_1.4.0              dplyr_0.8.5                purrr_0.3.4                readr_1.3.1                [21] tidyr_1.0.3                tibble_3.0.1               ggplot2_3.3.0              tidyverse_1.3.0            [25] aws.s3_0.3.21              ConfigParser_1.0.0         R6_2.4.1                   ini_0.3.1                  [29] DBI_1.1.0                  odbc_1.2.2                 lubridate_1.7.8             

loaded via a namespace (and not attached):  

[1] nlme_3.1-147          fs_1.4.1              bit64_0.9-7           httr_1.4.1            tools_4.0.0            [6] backports_1.1.6       rpart_4.1-15          lazyeval_0.2.2        colorspace_1.4-1      nnet_7.3-14           [11]
 withr_2.2.0           tidyselect_1.1.0      bit_1.1-15.2          curl_4.3              compiler_4.0.0        [16] cli_2.0.2             rvest_0.3.5           expm_0.999-4          xml2_1.3.2            tseries_0.10-47       [21] 
scales_1.1.0          mvtnorm_1.1-0         lmtest_0.9-37         fracdiff_1.5-1        quadprog_1.5-8        [26] digest_0.6.25         base64enc_0.1-3       pkgconfig_2.0.3       dbplyr_1.4.3          rlang_0.4.6           [31] 
readxl_1.3.1          rstudioapi_0.11       generics_0.0.2        jsonlite_1.6.1        magrittr_1.5          [36] Matrix_1.2-18         aws.ec2metadata_0.2.0 Rcpp_1.0.4.6          Quandl_2.10.0         munsell_0.5.0         [41] fansi_0.4.1 
lifecycle_0.2.0       stringi_1.4.6         MASS_7.3-51.6         recipes_0.1.12        [46] grid_4.0.0            blob_1.2.1            listenv_0.8.0         parallel_4.0.0        crayon_1.3.4          [51] lattice_0.20-41  
     haven_2.2.0           splines_4.0.0         hms_0.5.3             pillar_1.4.4          [56] boot_1.3-25           codetools_0.2-16      urca_1.3-0            reprex_0.3.0          glue_1.4.1            [61] packrat_0.5.0  
       modelr_0.1.7          vctrs_0.3.0           cellranger_1.1.0      gtable_0.3.0          [66] assertthat_0.2.1      gower_0.2.1           prodlim_2019.11.13    broom_0.5.6           class_7.3-17          [71] survival_3.1-12       timeDate_3043.102     aws.signature_0.5.2   lava_1.6.7            globals_0.12.5        [76] ellipsis_0.3.0    
    ipred_0.9-9
--
 
> | >
>
DavisVaughan commented 4 years ago

The cost of moving the data tibbles to your 28 workers and back almost definitely outweighs any benefits of running timetk::tk_ts() in parallel. It is best to only use future_map() with functions that take a significant amount of time that can also be executed in an embarrassingly parallel way.