segasai commented 3 years ago

Hi,

While trying to run multiple instances pymultinest with dask, I noticed that because dask starts worker processes in threads, it is unfortunately incompatible with signal handling done by pymultinest (leading to the "ValueError: signal only works in main thread" error). I then looked into whether signal.signal() signal handlers in pymultinest can be safely commented out.

And I think at this point they don't actually do anything. I.e. If I add time.sleep(5) to the pymultinest_demo.py likelihood to emulate long-running likelihood, then the Ctrl-C interruption will bring the following exception

Exception ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7ff3d1b68d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 260, in loglike
    return LogLikelihood(cube, ndim, nparams, lnew)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/solve.py", line 56, in SafeLoglikelihood
    l = float(LogLikelihood(a))
  File "pymultinest_demo.py", line 21, in myloglike
    time.sleep(5)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 97, in interrupt_handler
    sys.exit(1)
SystemExit: 1

but it will NOT interrupt the processing. If I comment out the signal.signal() completely I basically get the same behaviour.

KeyboardInterrupt: 
^CException ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7f2564dd2d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 260, in loglike
    return LogLikelihood(cube, ndim, nparams, lnew)
  File "/home/skoposov/soft/PyMultiNest/pymultinest/solve.py", line 56, in SafeLoglikelihood
    l = float(LogLikelihood(a))
  File "pymultinest_demo.py", line 21, in myloglike
    time.sleep(5)

So it seems to me that the signal handler is not useful. I don't quite know if it is possible to still make the whole thing interruptable https://stackoverflow.com/questions/14707049/allowing-ctrl-c-to-interrupt-a-python-c-extension -- this didn't illuminate me.

JohannesBuchner commented 3 years ago

These were added because interrupts can also occur while fortran code is executed.

segasai commented 3 years ago

I've just tested a case with 10000 live points and 1000 dimensions to force spending a lot of time inside fortran code and tried to interrupt it and I'm still getting the same result as before:

 Exception ignored on calling ctypes callback function: <function run.<locals>.loglike at 0x7f295fe29d30>
Traceback (most recent call last):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 257, in loglike
    def loglike(cube, ndim, nparams, lnew, nullcontext):
  File "/home/skoposov/soft/PyMultiNest/pymultinest/run.py", line 97, in interrupt_handler
    sys.exit(1)
SystemExit: 1

And again nothing actually is interrupted...

JohannesBuchner commented 3 years ago

try MPI parallelisation instead of dask

JohannesBuchner commented 3 years ago

I guess dask does pymultinest.run in threads/processes and your main program remains the master program. dask then would send wrapped exceptions back from the threads to the main program. But python exceptions while in the callback python functions cannot escalate their exception up through fortran.

segasai commented 3 years ago

try MPI parallelisation instead of dask

Thank you, I am aware of MPI but it is simply the wrong tool when trying to do a very large number of independent inference problems.

JohannesBuchner commented 3 years ago

I typically launch a new process for each data set, with make or xargs. On clusters you can use the batch system.

Perhaps a workaround is to exec a new process in each dask thread?

If you need a python solution, ultranest should be able to handle this better.

segasai commented 3 years ago

I guess dask does pymultinest.run in threads/processes and your main program remains the master program. dask then would send wrapped exceptions back from the threads to the main program. But python exceptions while in the callback python functions cannot escalate their exception up through fortran.

Dask is simply a launcher for large number of heterogeneous jobs. But the " exception wrapping" is not relevant here.

Right now Pymultinest will simply refuse to run inside a thread in any environment (dask or not), because setting of a signal handler is not allowed inside a thread.

Since it seems to me that the handler is a complete no-op (at least on the tests/systems I checked) I think it can be removed. But it is possible that I missing something.

If you are not convinced/ don't want to change anything, that is fine by me, I will just use a locally modified version.

JohannesBuchner commented 3 years ago

Ah, I see what you mean now. There must have been a change in ctypes's behaviour.

PRs are always welcome.

It would be good if we can stop the process somehow.

JohannesBuchner commented 3 years ago

Here is a potential solution by changing the interrupt_handler:

def interrupt_handler(recvsignal, frame):
    sys.stderr.write('ERROR: Interrupt received: Terminating\n')
    os.kill(os.getpid(), signal.SIGTERM)

JohannesBuchner commented 3 years ago

What does the error look like for setting signal handlers in dask threads?

Is there any reason you want threads instead of processes with dask (it supports several backends afaik)?

segasai commented 3 years ago

Thanks for taking a look at the root of the issue.

Changing the signal handler will make the code at least be interruptable in normal settings, but it will still refuse to run in the threaded environment (because it still requires a signal handler)
The reason for use threads I think is that the dask-worker by default start things in a thread even if you ask for nthreads=1 ( I don't know, it is maybe changeable somehow, but I haven't seen it)

The error that I get is : Exception: ValueError('signal only works in main thread') And it can be reproduced with this script (based on pymultinest demo)


from __future__ import absolute_import, unicode_literals, print_function
import numpy
from numpy import pi, cos
from pymultinest.solve import solve
import os
import time
from dask.distributed import Client
try:
os.mkdir('chains')
except OSError:
pass

def myprior(cube): return cube 10 pi

def myloglike(cube):

time.sleep(5)

chi = (cos(cube * 2.) + 1).sum()
return (2. + chi)

def func(a):

number of dimensions our problem has

parameters = ["x", "y"]
n_params = 10000  # len(parameters)
# name of the output files
prefix = "chains/3-"

# run MultiNest
result = solve(LogLikelihood=myloglike,
               Prior=myprior,
               n_live_points=100000,
               n_dims=n_params,
               outputfiles_basename=prefix,
               verbose=True)

if name == 'main': C=Client() for i in range(10): C.submit(func,i)

and this simple configuration

dask-scheduler --scheduler-file=sched.json & dask-worker --scheduler-file=sched.json &

JohannesBuchner commented 3 years ago

Seems like Client is implicitly creating a LocalCluster with, by default processes=False. Try processes=True? https://docs.dask.org/en/latest/setup/single-distributed.html

segasai commented 3 years ago

Not in my case. I'm providing a scheduler info. In fact in practice I'm starting dask-worker's on tens of nodes on a cluster. And even if I'm starting them with

$ dask-worker --scheduler-file=sched.json --nthreads=1

they are still inside a thread. And the reason is that dask-worker uses futures.ThreadPoolExecutor https://github.com/dask/distributed/blob/a0d60e7b2ad16304128c7989a111539a4d80ed38/distributed/worker.py#L631

segasai commented 3 years ago

One option to fix my issue that seems to work is to put signal.signal() inside if threading.active_count()==1: (that still keeps signal.signal as no-op though)

JohannesBuchner commented 3 years ago

Would putting C.shutdown() into the signal handler work?

https://stackoverflow.com/questions/50919227/is-it-possible-to-shutdown-a-dask-distributed-cluster-given-a-client-instance

segasai commented 3 years ago

I don't think it solves anything. IMO there are multiple issues/questions here that potentially require solving.

not being able to run pymultinest at all in dask -- that seems to be solvable by checking for number of threads
ATM pymultinest is not interruptible by Ctrl-C in the terminal (without dask)
pymultinest is not interruptible by SIGINT when run in dask (and what to do in this case) TBH I don't consider the last problem a significant one, as at some point SIGKILL will be sent if needed. But if anything if SIGINT is sent to a worker, killing the entire cluster is not a good idea.

JohannesBuchner commented 3 years ago

I pushed what I mentioned in https://github.com/JohannesBuchner/PyMultiNest/issues/178#issuecomment-773948089 as a commit, which addresses the second point.

segasai commented 3 years ago

I've created a PR #179 to not use signal handlers in threaded environments

segasai commented 3 years ago

thanks for merging the PR!

JohannesBuchner / PyMultiNest

signal handler in run.py #178

time.sleep(5)

number of dimensions our problem has