BradGreig / Hybrid21CM

1 stars 3 forks source link

exit SIG from C causes hangs for multiprocessing #19

Closed steven-murray closed 5 years ago

steven-murray commented 5 years ago

This is an issue which was first encountered in https://github.com/BradGreig/Hybrid21CM/issues/18.

The basic issue is that when using multiprocessing.Pool, with nthreads > 1, if any subprocesses receive an exit signal (eg. SIGINT or SIGSEGV), then they don't return, and therefore cause the entire program to hang.

Our multiprocessing is confined to the MCMC walkers. The best idea I can think of would be such that if such a signal is raised in the underlying C code, then Python should catch it, and either raise an Exception itself (which gets properly propagated and causes the whole program to halt), or return a likelihood of -inf for that walker/iteration, to indicate that the probability of that combination of parameters is so low it should never be visited again.

I think the former is more technically correct, since the latter may hide errors that really are just bugs, and we should aim for that.

The problem is, there doesn't seem to be a reliable way to catch these signals in child processes in Python (at least, not SIGSEGV). There is some discussion of this in https://bugs.python.org/issue22393.

I think what this means, at least for the short term, is that we need to ensure there are no segfaults in C code in 21CMMC. If a given combination of parameters is known to cause a segfault or similar, we need to catch that, and have some agreed-on system of reporting that back to Python so that it can raise an exception or whatever.

In addition, because this will never be perfect, we should have some docs that suggest to users that if it seems to hang, they may need to run in single-thread mode and watch for any errors that may occur.

steven-murray commented 5 years ago

Found some useful info: https://stackoverflow.com/questions/24894682/python-multiprocessing-crash-in-subprocess

BradGreig commented 5 years ago

Yeah, at the end of last week I read a little bit about this, hence I speculated this might be the cause of the issue.

In its present state, provided sensible parameters are passed to the mcmc it shouldn't be segfaulting. I suspect issues are arising as larger ranges than should be acceptable are being used.

steven-murray commented 5 years ago

Yeah, I guess we need a way of warning users when a particular parameter combination is ridiculous. It's up to the user to set the ranges on parameters, and they could put in something stupid, or more likely, something subtly unfortunate.

We basically need to have a few ways of exiting from C and telling Python something: 1) I'm not even gonna try to do this because I know these parameters won't work (i.e. the python function should return -np.inf) 2) I'm exiting with an error which I'm unsure of, so Python should probably also exit and give some kind of stacktrace for the user/developer to go fix. 3) Return the correct answer.

As an interim solution to (2), I've just found that concurrent.futures.ProcessPoolExecutor implements the behaviour that allows the Python Pool to crash when the C does, rather than hanging. This is implemented as default in ff16ee20969fdaed83f116e94cdfe4db38223fd5. It's not the best solution, as it doesn't give any hint as to what went wrong, though with some appropriate docs we can at least notify the user that if it happens it is likely due to a segfault, and they should report it.

Thus we have to come up with a standard way of doing (1) and (2) -- i.e. exiting from C prematurely and ensuring that Python knows "why", and discerning at least between the kinds of errors that mean that nothing went wrong except for bad parameters, and the kinds of errors that indicate bugs.

BradGreig commented 5 years ago

I like the sounds of this solution, at least in the interim. Exiting rather than hanging is far more desirable behaviour. If a stacktrace or the exact parameter combination are able to be output with an error, that should allow the user to check that that parameter combination does in fact cause issues. And thus report the issue as an error to be fixed.

Knowing which parameters will cause which errors is somewhat problematic, as they can cause issues in a variety of different ways. Some are more trivial, and easy to implement, others less so. However, I should be able to take a stab at this.

In principle, exiting from bad parameter choices should be implementable from the Python side. So I don't think anything really needs to be added on the C side.

steven-murray commented 5 years ago

Commit bb6b3afd52ef2d708dd7f1173c368d7052e9da89 adds the ability to return exit codes from C.

Exit code 0 is for success, 1 is for "probably some unphysical selection of parameters" (i.e. the MCMC can probably keep going using -inf for that particular iteration), and 2 is for "fatal C error which we didn't really expect" (i.e. MCMC should probably crash and the user should report it).

Of course, very unexpected crashes will still cause Python to crash, according to the above solution, but this is to be avoided as much as possible because it will be very hard to tell why it crashed. Much better to print to stdout from C what the issue is and return 2, letting python raise the exception, and leave a useful stacktrace for the user.

Before closing this issue, I'd like to ensure that the C code actually does this -- i.e. in every known place that the C could exit/crash, that instead it catches the potential error and returns 2 (preferably after writing some kind of warning to stdout). @BradGreig could you give the code a comb-through for these instances, and then we can close?

BradGreig commented 5 years ago

Will do when I can.

steven-murray commented 5 years ago

Moved to referenced issue