future.cache.path option broken

rimorob commented 4 years ago

There are two issues with the future.cache.path option. First, specifying it as an option in .options.future doesn't work. Setting R_FUTURE_CACHE_PATH, however, does work. Second, this option is well-explained in the code but not yet documented. It is tremendously useful for memory-intensive jobs in the cloud or on a cluster, since the file system might be on a network drive, and pointing batchtools to a marshaling folder on a locally mounted drive can dramatically accelerate job scheduling. In my case, job scheduling goes up at least 100x, and goes from 30% of job time to a negligible fraction, since the objects I'm sending to remote workers are largish. I wonder if it makes sense to explicitly suggest the use of a locally mounted folder for marshaling data once this feature is documented.

HenrikBengtsson commented 4 years ago

There are two issues with the future.cache.path option. First, specifying it as an option in .options.future doesn't work.

It is an option and does indeed work, e.g. options(future.cache.path = "/path/to"). You mention .options.future - it is an argument to the foreach(), which is completely unrelated. I've added an example to ?future.batchtools::future.batchtools.options and a sentence to ?doFuture::doFuture to clarify that .options.future is an argument and why it's named the way it is, cf. https://github.com/HenrikBengtsson/future.batchtools/commit/9ff5f87172f45a97c2d67a585bca2c531accc8b3 and https://github.com/HenrikBengtsson/doFuture/commit/8fb03bcbe34b045c274a0424adb590b926e9ee73.

... I wonder if it makes sense to explicitly suggest the use of a locally mounted folder for marshaling data once this feature is documented.

In general, I'm trying to minimize duplicating document that belongs elsewhere, in this case, I think the argument for where and why a registry folders should live in a specific place is better suited for the document of the batchtools package.

I've added a little bit more info to ?future.batchtools::future.batchtools.options on this option.

It's not clear to me what you mean by a `"locally mounted folder" but if you mean it should be a local disk then that won't work on an HPC environment.

rimorob commented 4 years ago

Thanks for your reply. Do I understand that the data is communicated to the workers by means of shared filesystem? It does make sense that that wouldn’t work, not sure why I thought it would. And it does explain why the remote workers crashed…. They did launch faster, though…. There’s still that problem of exporting repeat data across the workers, but that’s probably at the batch tools level, right?

Boris

On Oct 2, 2020, at 8:09 PM, Henrik Bengtsson notifications@github.com wrote:

Closed #64 https://github.com/HenrikBengtsson/future.batchtools/issues/64.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HenrikBengtsson/future.batchtools/issues/64#event-3835993132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFQNOFIRO7FYE6ARWKDQFTSIZTUVANCNFSM4SAYFD6A.

HenrikBengtsson commented 4 years ago

Do I understand that the data is communicated to the workers by means of shared filesystem?

Yes, per design of batchtools. In contrast, the clustermq package communicates via ZeroMQ ... but a future.clustermq backend is yet to be written.

There’s still that problem of exporting repeat data across the workers, but that’s probably at the batch tools level, right?

There efficient and inefficient ways of working with globals/exported data when it comes to map-reduce parallel processing using future.apply, furrr, foreach+doFuture, and likes. For example, doing:

xs <- large_data_object()
y <- foreach(x = xs) %par% {
  fcn(x)
}

is efficient because xs is chunked up upfront and only the chunks (=x) are exported. In contrast,

xs <- large_data_object()
y <- foreach(i = seq_along(xs)) %par% {
  fcn(xs[i])
}

is very in-efficient, because all of xs has to be exported to each worker.

rimorob commented 4 years ago

The object is an R6 class object with a complex architecture that is not a list; it has some data that doesn’t even need to be exported to the workers. There’s nothing to loop over except random seed. So I just ended up having to extract the unnecessary stuff manually, and now everything works, but in some cases one does have to export large data and there’s nothing to be done. This issue doesn’t impact my specific use case, and I had a cumbersome but workable work-around, but there is almost certainly a way to keep a single copy of the exported object when it’s shared by all workers.

Boris

On Oct 2, 2020, at 8:52 PM, Henrik Bengtsson notifications@github.com wrote:

Do I understand that the data is communicated to the workers by means of shared filesystem?

Yes, per design of batchtools. In contrast, the clustermq package communicates via ZeroMQ ... but a future.clustermq backend is yet to be written.

There’s still that problem of exporting repeat data across the workers, but that’s probably at the batch tools level, right?

There efficient and inefficient ways of working with globals/exported data when it comes to map-reduce parallel processing using future.apply, furrr, foreach+doFuture, and likes. For example, doing:

xs <- large_data_object() y <- foreach(x = xs) %par% { fcn(x) } is efficient because xs is chunked up upfront and only the chunks (=x) are exported. In contrast,

xs <- large_data_object() y <- foreach(i = seq_along(xs)) %par% { fcn(xs[i]) } is very in-efficient, because all of xs has to be exported to each worker.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HenrikBengtsson/future.batchtools/issues/64#issuecomment-703016760, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFQNOGBS7SUXAN42KUYOIDSIZYWBANCNFSM4SAYFD6A.

HenrikBengtsson / future.batchtools

future.cache.path option broken #64