Closed elgabbas closed 2 years ago
For the first one, it is the progress bar that is taking so long. The progress bar should mainly be used for things where each individual iteration takes a relatively large amount of time, otherwise the overhead of the progress bar outweighs its usefulness.
Also note that the progress bar is deprecated, and should not really be used anymore. I will eventually remove it in favor of the progressr package.
I'm not surprised that furrr is slower here. When the total time is < 5 seconds or so, I expect map()
to basically beat future_map()
every time.
require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)
# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))
Data
#> # A tibble: 1,600,000 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb ID
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 1
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 1
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1
#> # … with 1,599,990 more rows
tictoc::tic()
xx <- Data %>% mutate(disp2 = map_dbl(disp, identity))
tictoc::toc()
#> 0.943 sec elapsed
tictoc::tic()
xx <- Data %>% mutate(disp2 = future_map_dbl(disp, identity))
tictoc::toc()
#> 1.538 sec elapsed
Created on 2022-05-12 by the reprex package (v2.0.1)
I'll address the second question in a moment...
For the second question, you just forgot to ungroup()
after the nest()
. If you give nest()
a grouped data frame, it remains grouped after the nesting (for better or worse). This is preventing future_map()
from doing what it is good at - partitioning the data over the workers. Because there are 50,000 groups, it is calling future_map()
50,000 times. This also makes map()
run slower too.
It is exactly the problem outlined in the Common Gotchas vignette
require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)
# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))
NestedData <- Data %>%
group_by(ID) %>%
nest() %>%
ungroup()
tictoc::tic()
xx <- mutate(NestedData, data2 = map(data, identity))
tictoc::toc()
#> 0.105 sec elapsed
tictoc::tic()
xx <- mutate(NestedData, data2 = future_map(data, identity))
tictoc::toc()
#> 9.069 sec elapsed
This is an acceptable overhead to me, because it has to shuffle the nested data frames to and from the workers.
In my computer, furrr is slower than purrr using the same code, I don't know why?
Look closer at https://github.com/DavisVaughan/furrr/issues/234#issuecomment-1125239827
I'm already showing an example where furrr is slower. That's perfectly normal when you are sending over large datasets to each worker and then running an extremely cheap function on each one of them.
When doing parallel work, there can be large costs to sending "big" datasets over to the workers, which is not something that sequential evaluation has to do.
Also, in the future, we'd prefer if you open new issues rather than commenting on old ones. It is easier for us to keep track of!
Hello,
I would like to use
furrr
package to (row-wise) make some analysis of the data. I find that usingfurrr
is slower thanpurrr
, which is also reported by @hadley here: #41.Here is a repex
Here, I implement a simple function for each row of the data. There is a time difference, but not a huge difference.
However, when implementing another simple function on a nested dataset, it works fine using
purrr
but takes ages to run (if works altogether) usingfurrr
, which is weird.I tested this on two different R installations (Windows - R4.2.0 furrr0.3.0 --- and RstudioServer/R3.6/furrr0.2.3)
Is there a reason for this? Any advice to make a parallel analysis of nested datasets faster?
Cheers, Ahmed