futureverse / future.mirai

:rocket: R package future.mirai: A Future API for Parallel Processing using 'mirai'
https://future.mirai.futureverse.org/
21 stars 1 forks source link

TESTS: Test for terminating a future process abruptly is unstable #7

Closed HenrikBengtsson closed 4 months ago

HenrikBengtsson commented 4 months ago

The test for terminating future processes abruptly:

https://github.com/HenrikBengtsson/future.mirai/blob/b957a5fbe85b8d57695ba7819f979910901f2008/tests/mirai_cluster%2Cworker-termination.R#L18-L27

seems to be unstable. I've spotted twice that some CRAN checks fail with:

 [04:54:10.709] run() for 'MiraiFuture' ... done
  <FutureError: Failed to retrieve results from MiraiFuture (<none>). The mirai framework reports on error value 19>

  Future UUID: <NA>
  Error: nbrOfWorkers() == all - 1L is not TRUE
  Execution halted

and then later on comes back okay.

My best guess it's a timing/race condition issue, but not sure.

shikokuchuo commented 4 months ago

Error 19 just means that you terminated the connection before the value has been retrieved (it is documented).

E.g.:

library(mirai)
daemons(1)
#> [1] 1
m <- mirai(Sys.sleep(5))
daemons(0)
#> [1] 0
m$data
#> 'errorValue' int 19 | Connection reset

Created on 2024-05-10 with reprex v2.1.0

As a status() call is just a query (when using dispatcher), then this can also return an errorValue.

It's not important if you handle this in a test or not, but this possibility should be handled if the status is being used by the package.

HenrikBengtsson commented 4 months ago

As a status() call is just a query (when using dispatcher), then this can also return an errorValue.

Ah, it took me several iterations to grasp what you're trying to say here. If I understand you correctly, you're saying that current implementations of nbrOfWorkers() and nbrOfFreeWorkers();

https://github.com/HenrikBengtsson/future.mirai/blob/aeeb70c7e5711ac20834853bbe8c00c55df37418/R/nbrOfWorkers.R#L4-L12

will return 19 when mirai::status()[["daemons"]] is an errorValue, which is of type integer. And that will cause nbrOfWorkers() == all - 1L to fail.

Turns out I was checking for this is_error_value() elsewhere, but not here.

shikokuchuo commented 4 months ago

Yes exactly. The first test of workers should be for is_error_value() in which case you should just re-throw the error value (or wrap it in your own error message).

Sorry if I wasn't very clear - I'll make a note to myself to be more explicit next time!

Please continue to use is_error_value() for all these checks rather than relying on their class or type. I guarantee this API, and not the internal structure of what one actually is.

The actual error value is only important for diagnostic purposes - 19 here. The most common would be errorValue 5 timed out, which you will get when something has caused dispatcher to crash. You can see this yourself by just manually killing the dispatcher process and then calling status(). Equally we might get more exotic types (not as likely here, but they do occur in other places).

HenrikBengtsson commented 4 months ago

Thank you. I've now updated nbrOfWorkers() et al. to detect the error and re-throw as a FutureError.

What remains for me to understand is why status() sometimes return the correct daemons value and only sometimes an errorValue. Thus far, I have not been able to reproduce the errorValue myself, but it appears to occur occasionally on the CRAN check servers. Since my unit test does not output more info, I only know that stopifnot(nbrOfWorkers() == all - 1L) fails.

I understand that m$data will return errorValue (19) when the mirai process was terminated (using tools::pskill(pid = Sys.getpid()) in my unit test). But, why would status() return an errorValue and then, why, not all the time? Also, if status() returns an errorValue once, will it be stuck in that state and keep returning an errorValue if I called later on, or is it just temporarily? If the latter, I'll probably have nbrOfWorkers() to retry a few times before giving up and throwing the FutureError.

shikokuchuo commented 4 months ago

What remains for me to understand is why status() sometimes return the correct daemons value and only sometimes an errorValue. Thus far, I have not been able to reproduce the errorValue myself, but it appears to occur occasionally on the CRAN check servers.

I'll have a look at your test next week, but I have not had any reports for these status() errors since the very early days of testing for crew - and actually never for errorValue 19. I hence tend not to put too much weight on what happens within R CMD check - I am fairly sure that this doesn't happen in real world usage. To make changes just because of R CMD check is kind of like the tail wagging the dog (if you get the gist of what I mean).

Also, if status() returns an errorValue once, will it be stuck in that state and keep returning an errorValue if I called later on, or is it just temporarily? If the latter, I'll probably have nbrOfWorkers() to retry a few times before giving up and throwing the FutureError.

In general there should never be the need to 'retry'. If an error is returned by status() then it is in a wrong state, and there is likely no way to recover. The best course of action then is to reset daemons and start afresh.

shikokuchuo commented 4 months ago

If I look at your test, it's currently quite difficult to identify the failure. I suggest if you want to pin this down:

  1. Try calling pskill in the main process rather than in a daemon that kills itself.
  2. If you need to test pskill in a daemon that kills itself, then testing for the exact expected return value rather than a 'FutureError'.
HenrikBengtsson commented 4 months ago

Thanks for these comments and clarifications.

The purpose with this test is to understand what happens when a parallel worker process terminates abruptly, and for futureverse to be able to detect this and give an informative error message, rather than hard to troubleshoot sudden deaths. We see processes terminating suddenly on shared HPC environments, where a job consumes more memory than available on the host, or more than allocated by the job scheduler/CGroups, ... and other cases where the Out-Of-Memory (OOM) killer kicks in. Parallel R processes might also core dump/segfault, because of bugs, null pointers, etc. A process might also terminate, because a compute host shuts down or reboots for reasons that are not related to the job running.

The fact that this outcome of this test is not reproducible everywhere suggests I don't fully understand the different ways things can go wrong. BTW, the error on CRAN just disappeared, which means it's not reproducible there either. This means there will be users out there that will run into non-reproducible problems. Since we're at the center of it, I think it's better if we could try to handle these errors as early as possible.

shikokuchuo commented 4 months ago

As I understand it, error 19 here means that the status has been requested, and the daemon is killed before it has had the chance to send back the reply. As the round trip for a status() request is really very small, there is a very low probability of getting such an error - hence the difficulty in reproducing.

HenrikBengtsson commented 4 months ago

Thanks. I'm actually not sure that it's a 19 error that happens on CRAN, but I guess it's likely.

I'm hesitant in submitting an update that risk producing another type of error here and then risking the package being archive on CRAN (argh...). So, for now, I'm just going to handle that nbrOfWorkers()/nbrOfFreeWorkers() can produce a FutureError, otherwise it'll assert they return the expected value. I might make relax this in a future release to see if I can squeeze out more information about the problem from the CRAN servers. Or, maybe some user running into the problem will report on it with more clues.