JohannesBuchner / PyMultiNest

Pythonic Bayesian inference and visualization for the MultiNest Nested Sampling Algorithm and PyCuba's cubature algorithms.
http://johannesbuchner.github.io/PyMultiNest/
Other
194 stars 88 forks source link

signal handler in run.py #178

Closed segasai closed 3 years ago

segasai commented 3 years ago

Hi,

While trying to run multiple instances pymultinest with dask, I noticed that because dask starts worker processes in threads, it is unfortunately incompatible with signal handling done by pymultinest (leading to the "ValueError: signal only works in main thread" error). I then looked into whether signal.signal() signal handlers in pymultinest can be safely commented out.

And I think at this point they don't actually do anything. I.e. If I add time.sleep(5) to the pymultinest_demo.py likelihood to emulate long-running likelihood, then the Ctrl-C interruption will bring the following exception

Exception ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7ff3d1b68d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 260, in loglike
    return LogLikelihood(cube, ndim, nparams, lnew)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/solve.py", line 56, in SafeLoglikelihood
    l = float(LogLikelihood(a))
  File "pymultinest_demo.py", line 21, in myloglike
    time.sleep(5)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 97, in interrupt_handler
    sys.exit(1)
SystemExit: 1

but it will NOT interrupt the processing. If I comment out the signal.signal() completely I basically get the same behaviour.

KeyboardInterrupt: 
^CException ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7f2564dd2d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 260, in loglike
    return LogLikelihood(cube, ndim, nparams, lnew)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/solve.py", line 56, in SafeLoglikelihood
    l = float(LogLikelihood(a))
  File "pymultinest_demo.py", line 21, in myloglike
    time.sleep(5)

So it seems to me that the signal handler is not useful. I don't quite know if it is possible to still make the whole thing interruptable https://stackoverflow.com/questions/14707049/allowing-ctrl-c-to-interrupt-a-python-c-extension -- this didn't illuminate me.

JohannesBuchner commented 3 years ago

These were added because interrupts can also occur while fortran code is executed.

segasai commented 3 years ago

I've just tested a case with 10000 live points and 1000 dimensions to force spending a lot of time inside fortran code and tried to interrupt it and I'm still getting the same result as before:

 Exception ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7f295fe29d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 257, in loglike
    def loglike(cube, ndim, nparams, lnew, nullcontext):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 97, in interrupt_handler
    sys.exit(1)
SystemExit: 1

And again nothing actually is interrupted...

JohannesBuchner commented 3 years ago

try MPI parallelisation instead of dask

JohannesBuchner commented 3 years ago

I guess dask does pymultinest.run in threads/processes and your main program remains the master program. dask then would send wrapped exceptions back from the threads to the main program. But python exceptions while in the callback python functions cannot escalate their exception up through fortran.

segasai commented 3 years ago

try MPI parallelisation instead of dask

Thank you, I am aware of MPI but it is simply the wrong tool when trying to do a very large number of independent inference problems.

JohannesBuchner commented 3 years ago

I typically launch a new process for each data set, with make or xargs. On clusters you can use the batch system.

Perhaps a workaround is to exec a new process in each dask thread?

If you need a python solution, ultranest should be able to handle this better.

segasai commented 3 years ago

I guess dask does pymultinest.run in threads/processes and your main program remains the master program. dask then would send wrapped exceptions back from the threads to the main program. But python exceptions while in the callback python functions cannot escalate their exception up through fortran.

Dask is simply a launcher for large number of heterogeneous jobs. But the " exception wrapping" is not relevant here.

Right now Pymultinest will simply refuse to run inside a thread in any environment (dask or not), because setting of a signal handler is not allowed inside a thread.

Since it seems to me that the handler is a complete no-op (at least on the tests/systems I checked) I think it can be removed. But it is possible that I missing something.

If you are not convinced/ don't want to change anything, that is fine by me, I will just use a locally modified version.

JohannesBuchner commented 3 years ago

Ah, I see what you mean now. There must have been a change in ctypes's behaviour.

PRs are always welcome.

It would be good if we can stop the process somehow.

JohannesBuchner commented 3 years ago

Here is a potential solution by changing the interrupt_handler:

def interrupt_handler(recvsignal, frame):
    sys.stderr.write('ERROR: Interrupt received: Terminating\n')
    os.kill(os.getpid(), signal.SIGTERM)
JohannesBuchner commented 3 years ago

What does the error look like for setting signal handlers in dask threads?

Is there any reason you want threads instead of processes with dask (it supports several backends afaik)?

segasai commented 3 years ago

Thanks for taking a look at the root of the issue.

def myprior(cube): return cube 10 pi

def myloglike(cube):

time.sleep(5)

chi = (cos(cube * 2.) + 1).sum()
return (2. + chi)

def func(a):

number of dimensions our problem has

parameters = ["x", "y"]
n_params = 10000  # len(parameters)
# name of the output files
prefix = "chains/3-"

# run MultiNest
result = solve(LogLikelihood=myloglike,
               Prior=myprior,
               n_live_points=100000,
               n_dims=n_params,
               outputfiles_basename=prefix,
               verbose=True)

if name == 'main': C=Client() for i in range(10): C.submit(func,i)

and this simple configuration

dask-scheduler --scheduler-file=sched.json & dask-worker --scheduler-file=sched.json &

JohannesBuchner commented 3 years ago

Seems like Client is implicitly creating a LocalCluster with, by default processes=False. Try processes=True? https://docs.dask.org/en/latest/setup/single-distributed.html

segasai commented 3 years ago

Not in my case. I'm providing a scheduler info. In fact in practice I'm starting dask-worker's on tens of nodes on a cluster. And even if I'm starting them with

$ dask-worker --scheduler-file=sched.json --nthreads=1 

they are still inside a thread. And the reason is that dask-worker uses futures.ThreadPoolExecutor https://github.com/dask/distributed/blob/a0d60e7b2ad16304128c7989a111539a4d80ed38/distributed/worker.py#L631

segasai commented 3 years ago

One option to fix my issue that seems to work is to put signal.signal() inside if threading.active_count()==1: (that still keeps signal.signal as no-op though)

JohannesBuchner commented 3 years ago

Would putting C.shutdown() into the signal handler work?

https://stackoverflow.com/questions/50919227/is-it-possible-to-shutdown-a-dask-distributed-cluster-given-a-client-instance

segasai commented 3 years ago

I don't think it solves anything. IMO there are multiple issues/questions here that potentially require solving.

JohannesBuchner commented 3 years ago

I pushed what I mentioned in https://github.com/JohannesBuchner/PyMultiNest/issues/178#issuecomment-773948089 as a commit, which addresses the second point.

segasai commented 3 years ago

I've created a PR #179 to not use signal handlers in threaded environments

segasai commented 3 years ago

thanks for merging the PR!