HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

How can I monitor what is being passed out to the workers ? #40

Closed statquant closed 3 years ago

statquant commented 4 years ago

Hello, I am working with a list of big objects (around 100M per element) When using foreach I do

library(doFuture)
registerDoFuture()
plan(multicore)
foreach(data = data_list, .noexport = c('data_list')) %dopar% {
  some_fun(data)
}

I can see that the workers are taking a long time to start doing stuff, and I fear this is because too much stuff is passed to them (I know because using doMC I get log statements as they happen on the workers). How can I monitor exactly what is passed to the workers ? Ideally I'd like to pass all functions but only a handful of objects.

HenrikBengtsson commented 4 years ago

If you set options(future.debug=TRUE), you will find, among all the output, what's exported.

See also ?future.options - there's a built-in mechanism protecting that too large objects are exported.

statquant commented 4 years ago

Thank you, I've run the following example

library(doFuture)
registerDoFuture()
plan(multicore)
options(future.debug=TRUE)
foreach(i = 1:10) %dopar% matrix(rnorm(9e6), 1000) -> data_list
foreach(data = data_list, .noexport = 'data_list') %dopar% nrow(data)

the data_list list is > 500 M it seems that it needs to be "exported", I am not sure what it means, does it mean that the entire data_list will be exported to each worker? what I would have perhaps naively assumed is that only one element would be shipped to one worker (because of data = data_list)

I realise hat my understanding of what happens is weak and that might be the issue, I thought that in this case processes would be forked hence memory used by the master need not copying, and that objects modified on the worker would trigger a copy. If that was true I do not know why each worker would need to received the whole list.

HenrikBengtsson commented 4 years ago

For my understanding of your understand: What is the intent of:

foreach(i = 1:10) %dopar% matrix(rnorm(9e6), 1000) -> data_list

Why you create data_list in parallel (via %dopar%)? Is that essiential to your problem?

statquant commented 4 years ago

Hello, lol sorry no this is not essential at all I just find it is convenient and clear. From this code I expect to create large list from which only one element would be passed to each worker (as opposed to the entire list itself). What is not clear to me is

HenrikBengtsson commented 4 years ago

No worries, I just wanted to rule out the slight possibility that you were not expecting data_list to somehow, from that line, appear on the workers.

What is not clear to me is

  • in case of multicore what exactly happen (for instance is the entire list passed to each worker)
  • in case of slurm (or equivalent) what exactly happen

First, it all depends on how each future/parallel backend is implemented. But a good rule of thumb is that they are all implemented to export as little as possible needed. As far as I recall, all backends should report what they export in future.debug=TRUE output.

For multicore nothing will be exported, because, as you say, forked processing is used.

For future.batchjobs::batchtools_slurm you will see that all the identified globals will be exported. In your original example, data_list should not be picked up as a global variable (identification of globals work the same for all future backends), so, therefore, it should not be exported either. I often use multisession to validate that my future code works on all backends, because if there is a false-negative among the globals (i.e. one global is not automatically picked up), then it is very likely that you'll get a run-time error from the worker. Using multisession with options(future.debug=TRUE) will show all the globals identified and exported. You can expect future.batchjobs::batchtools_slurm to work very similarly. In addition, you can set, the still unofficial, non-documented, options(future.delete = FALSE). This will prevent future.batchtools from cleaning up the batchtools folders created under ./future/. That will allow you to browser those job folders to see exact what globals are exported. There will be a few files with random names, but at least you can peek at their sizes to see if something stands out, e.g.

$ ls -l .future/20191111_100803-Xyfozb/batchtools_2090193201/exports/
total 20
-rw-r--r-- 1 hb hb 14286 Nov 11 10:08 IZKU4.rds
-rw-r--r-- 1 hb hb    81 Nov 11 10:08 MFZGO4Y.rds
HenrikBengtsson commented 3 years ago

FWIW, since argument .noexport was part of this report/question: I've just fixed a bug (Issue #56) where doFuture completely ignored the .noexport argument.

statquant commented 3 years ago

Thank you for you comment, I now regularly use options(future.delete) when I have a problem and this is extremely helpful. I should have closed this question a long time ago. Many thanks for your detailed answer and your packages.

HenrikBengtsson commented 3 years ago

No worries. Hopefully, if we can get a working framework for hook functions implemented (https://github.com/HenrikBengtsson/future/issues/172), it will open up doors for interacting with futures in many ways, including tidy reporting of globals being exported, e.g. in the terminal or via a fancy HTML dashboard.