DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
697 stars 39 forks source link

resetting nms in multi_resolve when a subset of jobs fail #64

Closed yonicd closed 4 years ago

yonicd commented 5 years ago

when jobs fail the size of the vector mismatches the length of nms in multi_resolve causing the wrap up to fail.

would it be possible to reset nms if the lengths mismatch to allow for the successful jobs to be returned?

https://github.com/DavisVaughan/furrr/blob/b4ad6addda5cb2fab4baf51434bdacf7a75a94ad/R/resolve.R#L18

DavisVaughan commented 5 years ago

It might be easier to actually return something sensible on failure so values has the same length as nms. Do you have an example for me to test with?

DavisVaughan commented 4 years ago

I'm not entirely sure how to reproduce an example of a remote worker crashing, but I can crash a local multisession worker with library(trump). It seems like future will error when a worker crashes completely.

I think in the future there will be a future API for restarts, but until then I'm not sure there is much I can do

library(furrr)
#> Loading required package: future
#> Warning: package 'future' was built under R version 4.0.2

plan(multisession, workers = 2)

future_map(1:5, ~{
  if (.x == 1L) {
    library(trump) # <- will crash the worker
  } else {
    .x
  }
})
#> Error in unserialize(node$con): Failed to retrieve the value of MultisessionFuture (<none>) from cluster RichSOCKnode #1 (PID 73952 on localhost 'localhost'). The reason reported was 'error reading from connection'. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive.

plan(sequential)

Created on 2020-08-06 by the reprex package (v0.3.0)

yonicd commented 4 years ago

Yeah. I think the main problem was with remote vs multisession workers was that in remote workers the user could benefit from retrieving results from the non-crashed workers (usually due to swamping the memory of one of the worker machines) as compared to the multisession where it is the same machine (at least in your example). In the former it is annoying to lose all the results from stable machines just because one machine is problematic. if that makes sense