Closed statquant closed 3 years ago
If you set options(future.debug=TRUE)
, you will find, among all the output, what's exported.
See also ?future.options - there's a built-in mechanism protecting that too large objects are exported.
Thank you, I've run the following example
library(doFuture)
registerDoFuture()
plan(multicore)
options(future.debug=TRUE)
foreach(i = 1:10) %dopar% matrix(rnorm(9e6), 1000) -> data_list
foreach(data = data_list, .noexport = 'data_list') %dopar% nrow(data)
the data_list list is > 500 M
it seems that it needs to be "exported", I am not sure what it means, does it mean that the entire data_list will be exported to each worker? what I would have perhaps naively assumed is that only one element would be shipped to one worker (because of data = data_list
)
I realise hat my understanding of what happens is weak and that might be the issue, I thought that in this case processes would be forked hence memory used by the master need not copying, and that objects modified on the worker would trigger a copy. If that was true I do not know why each worker would need to received the whole list.
For my understanding of your understand: What is the intent of:
foreach(i = 1:10) %dopar% matrix(rnorm(9e6), 1000) -> data_list
Why you create data_list
in parallel (via %dopar%
)? Is that essiential to your problem?
Hello, lol sorry no this is not essential at all I just find it is convenient and clear. From this code I expect to create large list from which only one element would be passed to each worker (as opposed to the entire list itself). What is not clear to me is
No worries, I just wanted to rule out the slight possibility that you were not expecting data_list
to somehow, from that line, appear on the workers.
What is not clear to me is
- in case of multicore what exactly happen (for instance is the entire list passed to each worker)
- in case of slurm (or equivalent) what exactly happen
First, it all depends on how each future/parallel backend is implemented. But a good rule of thumb is that they are all implemented to export as little as possible needed. As far as I recall, all backends should report what they export in future.debug=TRUE
output.
For multicore
nothing will be exported, because, as you say, forked processing is used.
For future.batchjobs::batchtools_slurm
you will see that all the identified globals will be exported. In your original example, data_list
should not be picked up as a global variable (identification of globals work the same for all future backends), so, therefore, it should not be exported either. I often use multisession
to validate that my future code works on all backends, because if there is a false-negative among the globals (i.e. one global is not automatically picked up), then it is very likely that you'll get a run-time error from the worker. Using multisession
with options(future.debug=TRUE)
will show all the globals identified and exported. You can expect future.batchjobs::batchtools_slurm
to work very similarly. In addition, you can set, the still unofficial, non-documented, options(future.delete = FALSE)
. This will prevent future.batchtools from cleaning up the batchtools folders created under ./future/
. That will allow you to browser those job folders to see exact what globals are exported. There will be a few files with random names, but at least you can peek at their sizes to see if something stands out, e.g.
$ ls -l .future/20191111_100803-Xyfozb/batchtools_2090193201/exports/
total 20
-rw-r--r-- 1 hb hb 14286 Nov 11 10:08 IZKU4.rds
-rw-r--r-- 1 hb hb 81 Nov 11 10:08 MFZGO4Y.rds
FWIW, since argument .noexport
was part of this report/question: I've just fixed a bug (Issue #56) where doFuture completely ignored the .noexport
argument.
Thank you for you comment, I now regularly use options(future.delete)
when I have a problem and this is extremely helpful. I should have closed this question a long time ago.
Many thanks for your detailed answer and your packages.
No worries. Hopefully, if we can get a working framework for hook functions implemented (https://github.com/HenrikBengtsson/future/issues/172), it will open up doors for interacting with futures in many ways, including tidy reporting of globals being exported, e.g. in the terminal or via a fancy HTML dashboard.
Hello, I am working with a list of big objects (around 100M per element) When using
foreach
I doI can see that the workers are taking a long time to start doing stuff, and I fear this is because too much stuff is passed to them (I know because using doMC I get log statements as they happen on the workers). How can I monitor exactly what is passed to the workers ? Ideally I'd like to pass all functions but only a handful of objects.