furrr::future_map_dbl slower than purrr::map_dbl

franrodalg commented 3 years ago

Hi everyone,

I'm very sorry if I'm missing something extremely obvious, but I was playing around with furrr::future_map_dbl in case it could help me speed up some code that was taking ages, and I got quite surprised to see that not only it didn't become faster, but actually the opposite. I have checked the documentation, and it suggests there that grouped data frames could cause a similar issue, but I don't think that's the case here.

I hope the formatting for the example code is correct, but my apologies if it isn't. I have tried using reprex but it doesn't seem to work for me outside Rstudio, and plan(multisession) crashes every time I try to run it from Rstudio (any advice in that regard would also be appreciated!)

library(tidyverse)

x <- c("218H49M27H", "132H52M228H", "5H54M269H", "16H57M185H", "1407H15M2I20M1D21M30H", 
"92H31M1D10M3D20M34H", "116H14M1D3M1D5M1I38M24H", "226H18M2D1M1D43M32H","
147H20M3D42M276H", "119H62M26H") 

get_size <- function(x) {
    tibble(
      op=unlist(stringr::str_extract_all(x, "[aA-zZ]+")),
      size=as.numeric(unlist(stringr::str_extract_all(x, "[0-9]+")))) %>%
          filter(op %in% c('M', 'D')) %>%
          group_by(op) %>%
          summarise(size=sum(size)) %>%
          pull(size) %>% 
          sum()
}

library(tictoc)

plan(sequential)
tic()
purrr::map_dbl(x, get_size)
# [1] 49 52 54 57 57 65 62 65 65 62
toc()
# 0.333 sec elapsed

plan(multisession, workers=4)
tic()
furrr::future_map_dbl(x, get_size)
# [1] 49 52 54 57 57 65 62 65 65 62
toc()
# 5.494 sec elapsed

Is there anything on my code that could explain why furrr's version is substantially slower?

Cheers, Fran

DavisVaughan commented 3 years ago

What OS are you on?

A few things here, first off, the fact that plan(multisession) doesn't work in RStudio was an R bug that has been fixed in R devel, 4.2.0, and you can read a little about that here https://github.com/HenrikBengtsson/parallelly/issues/54

The quickest fix for you is to install the development version of parallelly, which is what future uses to set up the PSOCK nodes. It has a patch for R < 4.2.0 that will allow you to use it from RStudio again.

remotes::install_github("HenrikBengtsson/parallelly", ref="develop")

Second, remember that when your code runs on a parallel worker, that worker also has to load any packages required to run the code. So at the very least this seems to need tidyverse and stringr on each worker, which adds to the total time. Here is an approximate cost of just loading these packages:

library(tictoc)
tic(); library(tidyverse); library(stringr); toc()
#> 0.985 sec elapsed

Here is what the total time looks like for me:

library(dplyr)
library(stringr)
library(tictoc)
library(future)

x <- c(
  "218H49M27H", "132H52M228H", "5H54M269H", "16H57M185H", 
  "1407H15M2I20M1D21M30H", "92H31M1D10M3D20M34H", "116H14M1D3M1D5M1I38M24H", 
  "226H18M2D1M1D43M32H","147H20M3D42M276H", "119H62M26H"
) 

get_size <- function(x) {
  tibble(
    op=unlist(str_extract_all(x, "[aA-zZ]+")),
    size=as.numeric(unlist(str_extract_all(x, "[0-9]+")))
  ) %>%
    filter(op %in% c('M', 'D')) %>%
    group_by(op) %>%
    summarise(size=sum(size)) %>%
    pull(size) %>% 
    sum()
}

plan(sequential)
tic()
purrr::map_dbl(x, get_size)
#>  [1] 49 52 54 57 57 65 62 65 65 62
toc()
#> 0.089 sec elapsed

plan(multisession, workers=4)
tic()
furrr::future_map_dbl(x, get_size)
#>  [1] 49 52 54 57 57 65 62 65 65 62
toc()
#> 1.112 sec elapsed

plan(sequential)

This seems to make sense to me. A little over 1 second due to the cost of loading the required packages. Maybe it is more expensive to load the packages on your computer.

Nevertheless, I wouldn't recommend using furrr for things that typically take a few seconds or less anyways. Shoot for parallelizing things that take minutes or hours instead, so the benefits outweigh the static costs you'll have to pay to set up the workers.

franrodalg commented 3 years ago

Hi Davis!

Thanks so much for your answer.

What OS are you on?

I'm working on Mac OS 10.13, at the moment.

The quickest fix for you is to install the development version of parallelly, which is what future uses to set up the PSOCK nodes. It has a patch for R < 4.2.0 that will allow you to use it from RStudio again.

Thanks! it seemed to work well.

remember that when your code runs on a parallel worker, that worker also has to load any packages required to run the code. So at the very least this seems to need tidyverse and stringr on each worker, which adds to the total time.

Oh! I didn't realise that was the case. It makes sense!

Nevertheless, I wouldn't recommend using furrr for things that typically take a few seconds or less anyways. Shoot for parallelizing things that take minutes or hours instead, so the benefits outweigh the static costs you'll have to pay to set up the workers.

I used that short example just to facilitate the explanation here, but my data is substantially larger (and that's why I looked for a way to run in parallel). I had been doing some tests with 100 to 1000 input values, and I didn't see any improvement whatsoever when using furrr instead of purrr, but I've now run the entire dataset (>145k entries), and there is indeed some speed up (~430 instead of ~730 seconds), but possibly less than I imagined when requesting 4 "processes".

Is that improvement within the expected range?

Thanks again, Fran

franrodalg commented 3 years ago

Quick question. How does furrr know which package to load in order to run the code? For instance, would it load dplyr instead of the entire tidyverse?

DavisVaughan commented 3 years ago

Is that improvement within the expected range?

It seems reasonable. The other thing to keep in mind is that you have to actually send the data to the workers as well. So that 145k row data set has to be broken up and sent off to the workers, which takes some time too.

DavisVaughan commented 3 years ago

"Globals detection" relies completely on https://github.com/HenrikBengtsson/globals

In this case it probably does just detect that dplyr is the only thing that is required

franrodalg commented 3 years ago

Thanks so much, Davis. Really helpful!

ghost commented 2 years ago

Hi @DavisVaughan

I have the same problem !! and in my case downloading the parallely package doesn't help

(I am working with a windows system)

Could you please help me?

thanks

rm(list = ls())

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#> Error: RStudio not running
getwd()
#> [1] "C:/Users/Angela/AppData/Local/Temp/RtmpOqCRC2/reprex-44604912759f-full-husky"

#load required packages 
library(mc2d)
#> Loading required package: mvtnorm
#> 
#> Attaching package: 'mc2d'
#> The following objects are masked from 'package:base':
#> 
#>     pmax, pmin
library(gplots)
#> 
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#> 
#>     lowess
library(RColorBrewer)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
library(furrr)
#> Loading required package: future
library(future)   #for parallel computation
#remotes::install_github("HenrikBengtsson/parallelly", ref="develop") #to use multisession
library(parallelly)
library(tictoc)

set.seed(99)
iters<-1000

df<-data.frame(id=c(1:30),cat=c(rep("a",12),rep("b",18)),month=c(1:6,1,6,4,1,5,2,3,2,5,4,6,3:6,4:6,1:5,5),n=rpois(30,5))

df$n[df$n == "0"] <- 3
se<-rbeta(iters,96,6)
epi.a<-rpert(iters,min=1.5, mode=2, max=3)
p=0.2
p2=epi.a*p

df<-as_tibble(df)
# this defined function ensures any `n` from `df` will be itered with 10000 s and a and generated 10000 results
plan(multisession)
tic()
iter_n <- function(n) future_map2_dbl(.x = se, .y = p2, ~ 1 - (1 - .x * .y) ^ n)
list_1 <- df %>% mutate(Result = future_map(n, ~iter_n(.x))) %>% unnest(Result)%>% group_split(month)
toc()
#> 2.22 sec elapsed
plan(sequential)

#the same without parallelization 

tic()
iter_n <- function(n) map2_dbl(.x = se, .y = p2, ~ 1 - (1 - .x * .y) ^ n)
list_1 <- df %>% mutate(Result = map(n, ~iter_n(.x))) %>% unnest(Result)%>% group_split(month)
toc()
#> 0.08 sec elapsed

^{Created on 2022-05-08 by the reprex package (v2.0.1)}

DavisVaughan / furrr

furrr::future_map_dbl slower than purrr::map_dbl #195