Using future_map instead of loops in the main scripts to improve performance with multisession

CorrelAidSwitzerland / a4d

The project repository for the Data4Good project in collaboration with A4D

Other

1 stars 0 forks source link

Using future_map instead of loops in the main scripts to improve performance with multisession #101

Open lboel opened 11 months ago

lboel commented 11 months ago

I was just playing around with some redesign of the current loops in the main scripts into purrr:map and tried out just to use some easy furrr::future_map to make it multisession. With the right number of workers Ive ended up at around the half of the current time with the demo data. Logging seems to be an issue but at least during raw file read in and clean up (everything with a temp file at the end), multisession should at least improve run speed around 30%-50% and with furrr I had no errors on windows.

https://furrr.futureverse.org/

So the idea would be:

rewrite any loop to purrr:map (either way, this would not harm)
find out the best places to replace map with future_map

This would require to think about some future::plan issues, if we want to nest future_map calls: https://stackoverflow.com/questions/61506909/nested-furrrfuture-map

So I think in terms of readability we will win with purrr::map instead of loops and any furrr multisession improvement will be easy to implement.

pmayd commented 10 months ago

@lboel I think we should recheck this idea and give it some priority because this again will change quite a bit. But I really think we should parallelize again. To help oleg still run the code locally we could simply split the main functions into a code block that executes the pipeline in parallel and one that doesn't based on some parameter set to True or False. I guess the easiest way to get everything to work is to make a proper package out of a4d that we can use to call functions with a4d:: instead of having to source everything, correct? Lets try this

lboel commented 10 months ago

Indead a proper package will help a lot. Once we have a real package we could do easy multi sessioning with ease. Especially because the tracker files are independent, we could easily chunk the task based on tracker files.

I could imagine having a real package could allow us to start the process for chunks of tracker files and then run somekind of merge at the end, without even implement furrr or other parallel code in the package itself but more as a runner script for the package.

pmayd commented 10 months ago

I don't really get the last point, how will you process the chunks if not in parallel to speed up the code? so yes, we don't need it IN the package, of course, but like now, this would happen on the outer level in the main pipeline scripts that iterate over the tracker list. An this outer for loop is the most natural point for adding parallelisation I guess. The only problem in the past was that the code in the new threads did not have the packages loaded that were loaded with devtools::load_all() so we had to source everything with source() so the processes in the new spawned threads had the functions. I hope that this is not necessary with a real package because we can simply add all functions with a4d::. The only thing is to load a4d once inside or something like this depending on the package that we use for parallelisation.

So could you please open a new issue with making a4d a proper package and what we need for this?

pmayd commented 10 months ago

I guess Konrad already mentioned important points in the Slack discussion