Closed mlell closed 6 years ago
Thanks for this detailed report. It looks like a bug in the future framework (where a specific type of error condition is not handled correctly). What do you get if you can traceback()
immediately after you get that error message?
15: stop(condition)
14: resignalCondition(future)
13: value.Future(future)
12: value(future)
11: eval(quote({
value <- value(future)
rm(list = future_name, envir = assign.env)
value
}), new.env())
10: eval(quote({
value <- value(future)
rm(list = future_name, envir = assign.env)
value
}), new.env())
9: eval(expr, p)
8: eval(expr, p)
7: eval.parent(substitute(eval(quote(expr), envir)))
6: local({
value <- value(future)
rm(list = future_name, envir = assign.env)
value
})
5: mget(vars[ok], envir = x, inherits = FALSE)
4: as.list.listenv(x)
3: as.list(x)
2: Reduce(rbind, l) at #16
1: twolevel_implicit(3, 4)
Using debug(future:::signalCondition)
, the error is thrown at signalEarly.R:61
.
Thank you; I can reproduce this. The "bad error message" error is due to a malformed FutureError whose message
is empty (must be a string vector of length one):
> trace(future:::resignalCondition, at = 6L, tracer = quote(str(condition)))
> dwarmup <- twolevel_implicit(3,4)
Tracing resignalCondition(future) step 6
List of 2
$ message: chr(0)
$ call : NULL
- attr(*, "class")= chr [1:5] "FutureError" "simpleError" "error" "FutureCondition" ...
- attr(*, "future")=Classes 'MulticoreFuture', 'MultiprocessFuture', 'Future', 'environment' <environment: 0x3d96928>
Error in stop(condition) : bad error message
Why this is, I have to figure out, but I got what I need to troubleshoot this. After fixing this one, there will probably be a more informative error on why the nested multicore
processing fails here.
Short story of the non-informative bad error message
message; there was a mistake in construction of a FutureError condition where one of the arguments to sprintf()
being NULL (also by mistake) resulting in a zero-length message, which is invalid in R.
Having fixed this in the develop branch, I can now reproduce the underlying problem while getting a more informative error message:
library(future)
plan(list(tweak(multicore, workers = 2), tweak(multicore, workers = 2)))
fs <- lapply(1:2, FUN = function(i) {
future({
f <- future({ Sys.sleep(1); i })
value(f)
})
})
v <- values(fs)
this gives:
Error: Invalid usage of futures: A future whose value has not yet been collected can only
be queried by the R process (34813549-aeca-e2a0-4085-84086fa23b13; pid 16532 on hb-x1)
that created it, not by any other R processes (44b29749-07c0-3392-9a19-b6961534cfeb;
pid 16570 on hb-x1): {; f <- future({; Sys.sleep(1); i; }); value(f); }
My best guess for now is that this is due to an oversight by me when it comes to forked processes (i.e. when using multicore
). More precisely, the forked child process inherits an internal registry of active futures from the parent that it does not own but tries to cleanup/resolve. This can only happen when forked processes (multicore
) are in use. It should only occur when using two or more consecutive tweak(multicore, workers = two_or_more)
when setting up the plan()
. For instance, it will not be a problem when using any of the following:
plan(list(tweak(multicore, workers = 2), sequential, tweak(multicore, workers = 2)))
plan(list(tweak(multicore, workers = 2), tweak(multisession, workers = 2)))
As mentioned in the updated vignette (and in my SO answer), I don't recommend forcing nested parallel processing on the same machine this way;
This can be achieved by forcing a fixed number of workers at each layer (not recommended)
Having said all this, this is still a bug in the future framework for MulticoreFuture:s that needs to be fixed. It's not a quick fix, but it's also not a major one. I'd hope this will be fixed by next release together with a few other things.
Thanks again for reporting.
This has been fixed in the develop
branch (commit c9696a4)and, as usual for all bugs, I've added a package redundancy test for this case. You can try it with:
remotes::install_github("HenrikBengtsson/future@develop")
I'm closing, but please feel to reopen if needed.
PS. I like your graphical presentation of the parallel load. FYI there's a plan on automatic gathering (start, stop) times and possible other stats too (Issue #59) and since future (>= 1.8.0) this is basically just a matter on deciding the public API for access such info. PPS. The snow package (now basically deprecated) has some built-in "snow-timing" gathering/plotting, but unfortunately they dropped that when the moved its code into the parallel package.
@HenrikBengtsson Thank you very much for working this out!
I want, however comment on your updated vignette to that topic:
This behavior is due to the built-in protection against nested parallelism. If both layers would run in parallel, each using the 8 cores available on the machine, we would be running 8 * 8 = 64 parallel processes - that would for sure overload our computer.
I want to show you my use case to illustrate that this can be a useful thing and should IMO not be treated as an exotic corner case. I'm currently working on a machine with 2TB RAM and 60 physical CPUs. I suspect that among people who are interested in parallel computing, setups like these are not uncommon. Therefore I feel that the assertion "64 parallel processes will surely overload your computer" is a bit over-cautious. Also CPUs nowadays have overheat protection; the system throttles them in response to temperature sensors. So I do not think that you can brick your machine without tweaking processor internals.
I have to perform 1000 cross validation rounds for each of 10 subgroups of my data. Each round involves 1500 iterations of a Gibbs sampler. In "productive" operation, this means that I need only one level of parallelism, because the CPUs are busy enough performing the 1000 cross validations of one subgroup. However, as this takes about a day, I want to make sure that there are no errors downstream which might make me loose my results halfway through. Therefore, I usually test the pipeline in advance with toy settings, e.g. 3 cross validation rounds with 20 Gibbs samples. Your package is awesome to switch between these cases if it can multi-level parallelism well, because
plan(transparent)
plan(list(multicore, multicore))
to run the few toy samples of all data subgroups in parallelplan(list(sequential, multicore))
or a 2-level parallelism with resticted number of workers. All this is possible without deep changes to code logic because of future
if 2-level parallelism works! It apparently does now, I have still to test it, so thank you very much!
In this Stack Overflow answer, you explained how to achieve multi-layer-parallelism:
The problem is, that this apparently causes an error with a strange error message which by itself causes an error while handling:
Full 2-level parallelism is possible if using
MCVE:
"Workhorse" function (in real world)
Run for a random amount of time and return start and stop time Returns a 1-row data.frame with columns start and stop.
Plotting functions
All functions expects a data.frame with columns iL1, iL2, start, stop.
Functions which execute futures
Two variants: implicit and explicit futures.
Tests
Suggested solution from Stack overflow answer works.
Also, using a single-level multicore plan works, even when tweaking the number of workers.
However, 2-level multicore plans throw an error:
Session Info