HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
957 stars 83 forks source link

The efficiency of future, multisession with Rstudio #488

Closed stemangiola closed 3 years ago

stemangiola commented 3 years ago

As far as I know multisession does not work with Rstudio, this means that for parallelising in Rstudio, the whole session must be copied instead of only the data needed for parallelization.

This makes parallelization impractical when working with big datasets and heavy RStudio sessions (e.g. using Seurat). In short, R becomes for me single core only, which is a pity in 2021.

Is there something I am missing? How can I use R parallelization working from Rstudio?

Thanks a lot.

HenrikBengtsson commented 3 years ago

Hi.

As far as I know multisession does not work with Rstudio, ...

The 'multisession' plan does work in the RStudio Console. It's the 'multicore' (=forked processing) one that we protect against. See the warning you get when you try to use plan(multicore) and see ?parallelly::supportsMulticore for why - there's also a reference to the RStudio issue tracker discussing the problem. It also explains how to override this protected. I would not do that unless I am 100% sure the environment and the full software stack is fork safe.

... this means that for parallelising in Rstudio, the whole session must be copied instead of only the data needed for parallelization.

The future framework tries it best to be conservative and identify what needs to be exported. In the case of 'multisession' workers (=parallel PSOCK workers), it'll export objects needed to evaluate the expression on the worker. It won't export the whole R session.

In summary: forked processing in R seems magic, but it can cause lots of problems - problems that are sometimes silent or random. For it to work safely, all R packages in software dependency stack needs to work safely, but unfortunately, not all package developers are aware of the problem or don't validate/tests for it.

This is problem is not specific to the future framework - it's a problem for all parallelization solutions in R. The future framework is just a "thin" API layer on top of those solutions.

Hope this helps

stemangiola commented 3 years ago

The 'multisession' plan does work in the RStudio Console. It's the 'multicore' (=forked processing) one that we protect against. See the warning you get when you try to use plan(multicore) and see ?parallelly::supportsMulticore for why - there's also a reference to the RStudio issue tracker discussing the problem.

Yes sorry, my bad.

It also explains how to override this protected. I would not do that unless I am 100% sure the environment and the full software stack is fork safe.

Thanks, good to know

The future framework tries it best to be conservative and identify what needs to be exported. In the case of 'multisession' workers (=parallel PSOCK workers), it'll export objects needed to evaluate the expression on the worker. It won't export the whole R session.

I see. For example, it would be amazing if, when I use future_map only the element of the list (e.g. a character array of 10Kb) could be exported rather than (I assume) the whole dataset (e.g. a 10Gb object).

This is problem is not specific to the future framework - it's a problem for all parallelization solutions in R. The future framework is just a "thin" API layer on top of those solutions.

Yes, I was feeling that this was a very fundamental issue of the HTML interface to R and their communication protocol.

Do you think Rstudio will ever have the possibility and/or plans (I don't know with what technology) to solve the parallelization limitations?

HenrikBengtsson commented 3 years ago

Yes, I was feeling that this was a very fundamental issue of the HTML interface to R and their communication protocol.

Hmm... I don't understand this comment.

Do you think Rstudio will ever have the possibility and/or plans (I don't know with what technology) to solve the parallelization limitations?

I think the RStudio Console is just one example of what's mentions in the Section 'Warning' of ?parallel::mclapply, e.g. forked parallel processing and GUIs are not the best friends. I don't think it's easy to "fix" - but only RStudio can say what their plans are.

It's good that RStudio made an official comment about it, so people can make an active decision about it. In the future framework, we choose to protect unaware people by disable forked processing by default when running in the RStudio Console. We might do the same for other environment where we know it's also shaky.

I would not write code that rely on forked processing alone. Here is what the author of mclapply() wrote in R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html) on 2020-04-28:

Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

stemangiola commented 3 years ago

Thanks for the explanation and your time.