Is it possible to recover when the worker process segfaults?

pfernique commented 4 years ago

Hi,

I'm trying to launch some processes that can sometimes throw a segfault (And this can't be predicted or modified since I don't have the source code). This MWE code is behaving has I want using future::multiprocess plan (i.e., return 40 in the last line).

future::plan(future::multiprocess,
             workers = 2)

segfault <- function() {
  system('kill -11 $PPID')
}

processes <- c()

for (i in seq(1, 40)) {
  if (i == 20) {
    processes <- c(processes, withCallingHandlers({ future::future({segfault()}, lazy = TRUE, earlySignal=FALSE) }, error=function(...) {}))
  } else {
    processes <- c(processes, future::future({ Sys.sleep(2); i }, lazy = TRUE, earlySignal=FALSE))
  }
}

future::resolve(processes)
future::value(processes[[40]])

But, this MWE code is not behaving has I want using future.callr::callr plan (i.e., throw Error in readRDS(res) : error reading from connection).

future::plan(future.callr::callr,
             workers = 2)

segfault <- function() {
  system('kill -11 $PPID')
}

processes <- c()

for (i in seq(1, 40)) {
  if (i == 20) {
    processes <- c(processes, withCallingHandlers({ future::future({segfault()}, lazy = TRUE, earlySignal=FALSE) }, error=function(...) {}))
  } else {
    processes <- c(processes, future::future({ Sys.sleep(2); i }, lazy = TRUE, earlySignal=FALSE))
  }
}

future::resolve(processes)
future::value(processes[[40]])

Is it normal or do you have any idea why ? Note that I'm using future v1.15.1 and future.callr v0.5.0.

pfernique commented 4 years ago

I tried with future::multicore it's working fine but with future::multisession I have a similar problem (Error in unserialize(node$con) : error reading from connection).

HenrikBengtsson commented 4 years ago

Is it normal or do you have any idea why ?

Yes. There's lots of exception handling done in the future framework, and some of it even recoverable, but kicking workers far off the track is not automagically taken care.

Before anything else, use multicore or multisession explicitly. The multiprocess is just an alias to one of them depending on your operating system. I'm going to phase out multiprocess because it is ambiguous (e.g. I don't know what OS you're running here but reading between the lines in your error reports, it sounds like you're running on MS Windows). get a similar In the multisession case, we run PSOCK background workers (as defined by the parallel package) that communicate over a socket connection. If you kill a background worker, the communication with main R session is likely to become corrupted. In the future.callr::callr, which handled by the callr package, you get similar errors because callr communicates via the file system - a half-written file is corrupt. In the multicore case, workers are forked processes. Knocking those offline will confuse the main R process because it can no longer find a way to communicate with it's child process. The symptom will be something like a message on "An irrecoverable exception occurred. R is aborting now ..." from the forked process. On MS Windows, multicore equals sequential, which means the above example will kill the main R session.

In summary, what you're asking for is not part of the current future backend design. To support it, in general, would require looots of work. Even if it's a long-term roadmap, there are several things that need to come in place before it can be attacked. I also doubt one can cover cases such as sequential. Before this, it is more likely that someone develops a future backend that can handle severe corruption like this. Indeed, it might be that the batchtools package supports it, e.g. try with the sequential plan(future.batchtools::batchtools_local).

pfernique commented 4 years ago

Thanks for your reply ! I'm on Windows Subsystem for Linux (that behaves as Linux). I was glad enough to find that one backend could recover from segfaults. I was just surprised that it wasn't the future.callr::callr backend. Since callr communicates via the filesystem, I thought that handling segfaults will be easier: I was using callr when the code wasn't parallelized since the segfault in the launched session was transformed as an errror (only for the segfaulting process not the following ones) in my current session (Error in readRDS(res) : error reading from connection).

future.batchtools::batchtools_local seems quite interesting, I will give it a try !

HenrikBengtsson commented 4 years ago

saveRDS() is not atomic, so if killed in the middle of a write it will leave behind a half-written file, which results in that readRDS() error.

pfernique commented 4 years ago

Yes, I have no problem understanding that. It's just that it seems to indicate that all processes use the same rds file (otherwise I really don't get why a rds file corrupted by a process would lead to corrupted rds files for all remaining processes ) and I naively believed that a different rds file would be used for each process.

HenrikBengtsson / future.callr

Is it possible to recover when the worker process segfaults? #11