Closed franrodalg closed 3 years ago
What OS are you on?
A few things here, first off, the fact that plan(multisession)
doesn't work in RStudio was an R bug that has been fixed in R devel, 4.2.0, and you can read a little about that here https://github.com/HenrikBengtsson/parallelly/issues/54
The quickest fix for you is to install the development version of parallelly, which is what future uses to set up the PSOCK nodes. It has a patch for R < 4.2.0 that will allow you to use it from RStudio again.
remotes::install_github("HenrikBengtsson/parallelly", ref="develop")
Second, remember that when your code runs on a parallel worker, that worker also has to load any packages required to run the code. So at the very least this seems to need tidyverse and stringr on each worker, which adds to the total time. Here is an approximate cost of just loading these packages:
library(tictoc)
tic(); library(tidyverse); library(stringr); toc()
#> 0.985 sec elapsed
Here is what the total time looks like for me:
library(dplyr)
library(stringr)
library(tictoc)
library(future)
x <- c(
"218H49M27H", "132H52M228H", "5H54M269H", "16H57M185H",
"1407H15M2I20M1D21M30H", "92H31M1D10M3D20M34H", "116H14M1D3M1D5M1I38M24H",
"226H18M2D1M1D43M32H","147H20M3D42M276H", "119H62M26H"
)
get_size <- function(x) {
tibble(
op=unlist(str_extract_all(x, "[aA-zZ]+")),
size=as.numeric(unlist(str_extract_all(x, "[0-9]+")))
) %>%
filter(op %in% c('M', 'D')) %>%
group_by(op) %>%
summarise(size=sum(size)) %>%
pull(size) %>%
sum()
}
plan(sequential)
tic()
purrr::map_dbl(x, get_size)
#> [1] 49 52 54 57 57 65 62 65 65 62
toc()
#> 0.089 sec elapsed
plan(multisession, workers=4)
tic()
furrr::future_map_dbl(x, get_size)
#> [1] 49 52 54 57 57 65 62 65 65 62
toc()
#> 1.112 sec elapsed
plan(sequential)
This seems to make sense to me. A little over 1 second due to the cost of loading the required packages. Maybe it is more expensive to load the packages on your computer.
Nevertheless, I wouldn't recommend using furrr for things that typically take a few seconds or less anyways. Shoot for parallelizing things that take minutes or hours instead, so the benefits outweigh the static costs you'll have to pay to set up the workers.
Hi Davis!
Thanks so much for your answer.
What OS are you on?
I'm working on Mac OS 10.13, at the moment.
The quickest fix for you is to install the development version of parallelly, which is what future uses to set up the PSOCK nodes. It has a patch for R < 4.2.0 that will allow you to use it from RStudio again.
Thanks! it seemed to work well.
remember that when your code runs on a parallel worker, that worker also has to load any packages required to run the code. So at the very least this seems to need tidyverse and stringr on each worker, which adds to the total time.
Oh! I didn't realise that was the case. It makes sense!
Nevertheless, I wouldn't recommend using furrr for things that typically take a few seconds or less anyways. Shoot for parallelizing things that take minutes or hours instead, so the benefits outweigh the static costs you'll have to pay to set up the workers.
I used that short example just to facilitate the explanation here, but my data is substantially larger (and that's why I looked for a way to run in parallel). I had been doing some tests with 100 to 1000 input values, and I didn't see any improvement whatsoever when using furrr
instead of purrr
, but I've now run the entire dataset (>145k entries), and there is indeed some speed up (~430 instead of ~730 seconds), but possibly less than I imagined when requesting 4 "processes".
Is that improvement within the expected range?
Thanks again, Fran
Quick question. How does furrr
know which package to load in order to run the code? For instance, would it load dplyr
instead of the entire tidyverse
?
Is that improvement within the expected range?
It seems reasonable. The other thing to keep in mind is that you have to actually send the data to the workers as well. So that 145k row data set has to be broken up and sent off to the workers, which takes some time too.
"Globals detection" relies completely on https://github.com/HenrikBengtsson/globals
In this case it probably does just detect that dplyr is the only thing that is required
Thanks so much, Davis. Really helpful!
Hi @DavisVaughan
I have the same problem !! and in my case downloading the parallely package doesn't help
(I am working with a windows system)
Could you please help me?
thanks
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#> Error: RStudio not running
getwd()
#> [1] "C:/Users/Angela/AppData/Local/Temp/RtmpOqCRC2/reprex-44604912759f-full-husky"
#load required packages
library(mc2d)
#> Loading required package: mvtnorm
#>
#> Attaching package: 'mc2d'
#> The following objects are masked from 'package:base':
#>
#> pmax, pmin
library(gplots)
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
library(RColorBrewer)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyverse)
library(furrr)
#> Loading required package: future
library(future) #for parallel computation
#remotes::install_github("HenrikBengtsson/parallelly", ref="develop") #to use multisession
library(parallelly)
library(tictoc)
set.seed(99)
iters<-1000
df<-data.frame(id=c(1:30),cat=c(rep("a",12),rep("b",18)),month=c(1:6,1,6,4,1,5,2,3,2,5,4,6,3:6,4:6,1:5,5),n=rpois(30,5))
df$n[df$n == "0"] <- 3
se<-rbeta(iters,96,6)
epi.a<-rpert(iters,min=1.5, mode=2, max=3)
p=0.2
p2=epi.a*p
df<-as_tibble(df)
# this defined function ensures any `n` from `df` will be itered with 10000 s and a and generated 10000 results
plan(multisession)
tic()
iter_n <- function(n) future_map2_dbl(.x = se, .y = p2, ~ 1 - (1 - .x * .y) ^ n)
list_1 <- df %>% mutate(Result = future_map(n, ~iter_n(.x))) %>% unnest(Result)%>% group_split(month)
toc()
#> 2.22 sec elapsed
plan(sequential)
#the same without parallelization
tic()
iter_n <- function(n) map2_dbl(.x = se, .y = p2, ~ 1 - (1 - .x * .y) ^ n)
list_1 <- df %>% mutate(Result = map(n, ~iter_n(.x))) %>% unnest(Result)%>% group_split(month)
toc()
#> 0.08 sec elapsed
Created on 2022-05-08 by the reprex package (v2.0.1)
Hi everyone,
I'm very sorry if I'm missing something extremely obvious, but I was playing around with
furrr::future_map_dbl
in case it could help me speed up some code that was taking ages, and I got quite surprised to see that not only it didn't become faster, but actually the opposite. I have checked the documentation, and it suggests there that grouped data frames could cause a similar issue, but I don't think that's the case here.I hope the formatting for the example code is correct, but my apologies if it isn't. I have tried using
reprex
but it doesn't seem to work for me outside Rstudio, andplan(multisession)
crashes every time I try to run it from Rstudio (any advice in that regard would also be appreciated!)Is there anything on my code that could explain why
furrr
's version is substantially slower?Cheers, Fran