Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
65 stars 29 forks source link

Handle worker abort better #249

Open mtmorgan opened 1 year ago

mtmorgan commented 1 year ago

When a worker aborts (e.g., out of memory?) the result is an error in BiocParallel code, rather than an error understandable by the user.

> bplapply(1:2, \(...) q())
Error in reducer$value.cache[[as.character(idx)]] <- values :
  wrong args for environment subassignment
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
HenrikBengtsson commented 1 year ago

To add some ideas, I run a post-mortem analysis when this happens and include the findings in the error message, e.g.

> library(future)
> plan(multicore)
> f <- future({ tools::pskill(Sys.getpid()) })
> value(f)
Error: Failed to retrieve the result of MulticoreFuture (<none>) from the forked
worker (on localhost; PID 118742). Post-mortem diagnostic: No process exists
with this PID, i.e. the forked localhost worker is no longer alive
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result

and

> library(future)
> plan(multisession)
> f <- future(tools::pskill(Sys.getpid()))
TRACKER:  loadedNamespaces() changed:  1 package loaded ('crayon')
> value(f)
Error in unserialize(node$con) : 
  MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 119302 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic: No
process exists with this PID, i.e. the localhost worker is no longer alive

In some cases, we can give more clues. For example, when a non-exportable object may be in play, e.g.

> library(future)
> plan(multisession)
> library(XML)
> doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
> a <- getNodeSet(doc, "/doc//a[@status]")[[1]]
> f <- future(xmlGetAttr(a, "status"))
> value(f)
Error in unserialize(node$con) :
  MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 31541 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic:
No process exists with this PID, i.e. the localhost worker is no
longer alive. Detected a non-exportable reference ('externalptr' of
class 'XMLInternalElementNode') in one of the globals ('a' of class
'XMLInternalElementNode') used in the future expression. The total
size of the 1 globals exported is 520 bytes. There is one global: 'a'
(520 bytes of class 'externalptr')

That exported non-exportable XML object causes XML to segfault the parallel worker, cf. https://future.futureverse.org/articles/future-4-non-exportable-objects.html#package-xml.

I found that these type of error messages helps the user to help themselves, but it also saves me a lot of time when someone reaches out for help.