Closed kendonB closed 1 year ago
The
workers
argument toplan
doesn't work as I'd like to keep the jobs small so I can get load balancing working.
Thanks for feedback, though this is a bit too sparse to fully understand what you've observed/concluded. Can you please clarify/expand on this? (*)
(*) It could be related to what I've recently realized myself (while implementing future.callr; https://github.com/HenrikBengtsson/future.callr/commit/32a8e3c2412e5cc47bc1ee99b0483670aabf4083) - argument 'workers' is only respected for future_lapply()
calls but ignored when using future()
directly. It's something I've overlooked here and that part is a bug.
Apologies for the lack of clarity! I am using future_lapply at the moment. With workers = 100, the behavior I currently see is that the N >> 100 jobs are distributed amongst 100 workers, so the time spent is determined by the slowest set of jobs sent to one of the 100 workers.
The feature requested is an argument that would limit the total number of workers at any time to 100, but would create new workers once any of the first 100 completes.
Of course, the workers
argument is also very useful to avoid the overhead associated with creating the workers dominating runtime.
Still not 100% sure (because naming convention clashes of what it is meant by a "job", "future", "worker"), but does argument future.scheduling
(especially if set to FALSE
) for future_lapply()
provide what you need?
The docs for future.scheduling
say: Average number of futures ("chunks") per worker. If ‘0.0’, then a single future is used to process all elements of ‘x’. If ‘1.0’ or ‘TRUE’, then one future per worker is used. If ‘2.0’, then each worker will process two futures (if there are enough elements in ‘x’). If ‘Inf’ or ‘FALSE’, then one future per element of ‘x’ is used.
For the future
package, is it appropriate to refer to each element of x
as a job, and each cluster process (when using distributed) as a worker?
Then, as I understand it, a future
is some chunk of x
(or number of jobs) that can resolve at some future point in time from the point of view of the host process. If a worker is processing more that one future, the first ones resolve before the worker is done with all of its work?
If I'm getting this at all right, future.scheduling
controls when the results of the jobs resolve from the perspective of the host process?
With this issue, I'm really looking for a way for new workers to spring up when old ones finish with their work, while limiting the total number that exist at one time. This is kind of like the behavior of parLapplyLB
. With length(x) == 5000
, I would want future_lapply
to ask batchtools
to first create 100 workers for x[1:100]
, then once the first finishes it would create the 101th worker and so on.
Is this what future.scheduling = FALSE
does when using batchtools
?
I've had a chance to look a bit more into this. Turns out that due to the bug I mention in https://github.com/HenrikBengtsson/future.batchtools/issues/18#issuecomment-346968432, future_lapply(x, FUN, future.scheduling = FALSE)
will not acknowledge n
in plan(batchtools_nnn, workers = n)
. I'll fix that.
I haven't forgot about this one; this is still a bug in future.batchtools, cf. Issue #19.
This was actually resolved when issue #19 was resolved in the v0.11.0 [2022-12-13] release
I have a case where I want to submit a large number of jobs but limit the amount running at once. The I/O ends up being limiting so there's no apparent benefit (and possibly some cost) of going from 100 to 200 jobs at a time.
In batchtools, it looks like this is achieved at the registry level, and future.batchtools seems to create a single registry per future so
future.batchtools
might have to implement it's own limiter?The
workers
argument toplan
doesn't work as I'd like to keep the jobs small so I can get load balancing working.