Closed fzeiser closed 4 years ago
This is a good question. The core of GAMBIT does not make use of OpenMP threads itself, but internally in the various Bits you can use use OpenMP threading. (But we currently don't in OMBit.) I'm not sure what happens if a bakend tries to use OpenMP -- it would certainly be interesting to test if it works, or if (and how) it fails.
As of now, the gledeli backend uses OpenMP (though ompy) here; https://github.com/anderkve/gambit_np/blob/c811eaef81eb286cd65bf0f70c16876d370b53c8/Backends/installed/gledeli/1.0/gledeli/lnlike_firstgen.py#L72-L74
Actually, it depends on the compilation settings of ompy whether it uses OpenMP there, but I has the possibility to use it (it was difficult to get OpenMP running on macs). The way I run gambit now, I get 8 OpenMP threads, which I can confirm that they are used on the machine I run it.
They were not used before I added nld_T_product
, so they are attributable to the backend.
Starting GAMBIT
----------
WARNING! Running in SERIAL (no MPI) mode! Recompile with -DWITH_MPI=1 for MPI parallelisation
----------
Running with 8 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
Cool! Assuming that the results you got with the 8 OpenMP threads are as you would expect running outside of GAMBIT, a next question is if this still works when we run with MPI-parallelised scanning in GAMBIT.
You can give it at try by rebuilding GAMBIT with -DWITH_MPI=True in the cmake command, and then launch GAMBIT with something like mpiexec -np <number of MPI processes> ./gambit -rf yaml_files/ombit_demo.yaml
I just have to wait for the current run to finish, which will take some time. So I can check whether it still works (though not necessarily very smart, as only one of the likelihood functions is OpenMP capable)
Assuming that the results you got with the 8 OpenMP threads are as you would expect running outside of GAMBIT
What do you mean by this?
Ah, just that the result you get for a given parameter point with GAMBIT+GLEDEli is the same as you would get running GLEDELi as a stand-alone tool outside of a GAMBIT scan. I'm sure it is, though -- just wanted to be sure that it's working as expected before moving on to including MPI. (I've seen some weird threading/memory bugs in my days...)
I'm sure it is, though -- just wanted to be sure that it's working as expected before moving on to including MPI. (I've seen some weird threading/memory bugs in my days...)
Scary! I will check that, too. I realized I can just create one more VM to run GAMBIT on and check this simultaneously ;P.
When compiling gambit without MPI support, GAMBIT + GELDELi gives the same results as the same parameters run through GLEDELi directly. Checked against a parameter set from random
scanner. [No-MPI; OpenMP: 8 cores]
Now I've tried to recompile it with MPI and I run into some other errors. I get a segmentation fault at the end of the run (After GAMBIT has finished successfully!
). I had to modify the system quite a bit to be able to get the MPI compilation running. After all these modifications, the segmentation fault appears both for compilation with and without MPI support. :/
[The calculations seem to run just fine before; I get the logs of each MPI process etc. so I'll just assume for now things work fine and report the results when assuming this]
Assuming that the MPI works fine and that the error is related to something else(?), the calculations also match with MPI. I tried the same as above, taking some parameters form OMBit.log_2
and feeding them to gledeliBE.py
. It returned the same likelihoods.
This is very good news! :)
The segfault at the end sounds like a known issue that came up a month or two ago in GAMBIT. I'm pretty sure it was due to some MPI-related object in GAMBIT not being properly deleted before the MPI machinery is shut down. I'll have a look at if/how that was fixed and try to port that to our code. But it doesn't actually affect anything, so we can just keep on going with MPI.
Thanks for checking this out. Actually, on a longer run (with MultiNest instead of random), I encountered the segmentation fault during the run. Maybe it's the same problem. I hope so ;). Tired of fixing hickups. But that's part of the game I guess.
Do you have any "standard setup"? LIke a docker image, or other "usual setup" you run gambit with and know that it is as stable as possible there?
[...]
Nested Sampling ln(Z): -8321.759285
Importance Nested Sampling ln(Z): -4748.853431 +/- 0.999989
Acceptance Rate: 0.228910
Replacements: 10900
Total Samples: 47617
Nested Sampling ln(Z): -7937.874406
Importance Nested Sampling ln(Z): -4520.457825 +/- 0.999989
[gambit-np-mpi:21573] *** Process received signal ***
[gambit-np-mpi:21573] Signal: Segmentation fault (11)
[gambit-np-mpi:21573] Signal code: Invalid permissions (2)
[gambit-np-mpi:21573] Failing at address: 0x7f4d36951000
[gambit-np-mpi:21573] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4d51a95390]
[gambit-np-mpi:21573] [ 1] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__posterior_MOD_pos_samp+0x3a4d)[0x7f4d1f3f3bb8]
[gambit-np-mpi:21573] [ 2] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_clusterednest+0x21382)[0x7f4d1f427076]
[gambit-np-mpi:21573] [ 3] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_nestsample+0x10b7)[0x7f4d1f42c0b5]
[gambit-np-mpi:21573] [ 4] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_nestrun+0x1970)[0x7f4d1f42de31]
[gambit-np-mpi:21573] [ 5] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(run+0x25a)[0x7f4d1f42e0eb]
[gambit-np-mpi:21573] [ 6] /home/ubuntu/gambit_np/ScannerBit/lib/libscanner_multinest_3.11.so(_ZN60__gambit_plugin_multinest__t__scanner__v__3_11___namespace__11PLUGIN_MAINEv+0x1146)[0x7f4d1f664ee1]
[gambit-np-mpi:21573] [ 7] ./gambit[0x97de3d]
[gambit-np-mpi:21573] [ 8] ./gambit[0x979ee6]
[gambit-np-mpi:21573] [ 9] ./gambit[0x974a87]
[gambit-np-mpi:21573] [10] ./gambit[0x6a4de3]
[gambit-np-mpi:21573] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4d4f1af830]
[gambit-np-mpi:21573] [12] ./gambit(_start+0x29)[0x65f6f9]
[gambit-np-mpi:21573] *** End of error message ***
rank 4: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 5: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 6: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 7: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 7: Tried to synchronise for shutdown (attempt 21) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 21573 on node gambit-np-mpi exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I will try to reproduce this
ToDo:
I get a segmentation fault at the end of the run (After GAMBIT has finished successfully!).
I can at least reproduce this segfault at the end, both with multinest and the random scanner. Interestingly it happens even when I only use the cout printer. Will keep investigating...
Two questions related to the MPI problems:
Does running mpiexec -np 2 ./gambit -rf yaml_files/spartan.yaml
work for you? For me this example works with MPI, with the hdf5 printer and with all three main scanners (random, de, multinest). (Note that you need to delete the folder in runs/spartan in between attempts.)
Currently it seems that gledeli imports pymultinest via ompy. This seems to be causing problems for me. Is it necessary for gledeli to pull in pymultinest via ompy, or can we avoid this?
Currently it seems that gledeli imports pymultinest via ompy. This seems to be causing problems for me. Is it necessary for gledeli to pull in pymultinest via ompy, or can we avoid this?
It is not necessary, it's a side effect of how ompy is written. It's on the list of things to change, so now it'll just have to change sooner ;P.
Ah, cool. Thanks. :)
Here is a mokeypatch until I will have rewritten things correctly bb80610 on the branch fix/mock_pymultinest
just mocks pymultinest instead of importing it. Could you try to run with it (the VM with gambit just crashed/doesn't want to restart, so I need to fix this)
Thanks, that was quick! :)
That helped a lot. Now I see the following:
Simply by running diagnostics commands such as ./gambit scanners
or ./gambit backends
, I get a segfault at the end. This does not happen if I deactivate gledeliBE by commenting it out from config/backend_locations.yaml.default
. And it does happen if I replace gledeliBE.py with a completely empty dummy python file.
So in conclusion, the segfault after GAMBIT finishes is related to Pybind11 and the closing of the python interpreter (which only runs if there is at least one python backend connected) at the end of a GAMBIT run, and has nothing to do with the hdf5 printer, or with gledeli. It is probably the same problem that I saw emails about today on the GAMBIT email lists today. So this will go away when it's fixed in GAMBIT (or in Pybind11) and we re-connect our gambit_np development with the main GAMBIT repo.
Now, ignoring the segfault at the end, I can successfully run mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml
(reduced to a quick 1D scan for speed) with all the following combinations of scanner + printer:
So this seems very promising :)
Perfect. Let's keep this open until we have found out whether GLEDELi should be included via NUCBit / NUCLEARBit in GAMBIT or whether we fork GAMBIT.
I can now run the sample (mocking pymultinest) with the system described in #18. Thanks a lot!
However, in a longer run, I get the same error as above (a different memory address, of course). Let's see what will happen once we implement the changes.
@anderkve Can you reproduce the error?
mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml
on ea30b8ce65998b5b5a1530d49eded6053e58fcd7
with
ombit_demo.yaml.txt. Note that I had to rename the file for the upload here.
The error comes after some time, maybe ~30min.
Trying to reproduce the error now. Quick question: the yaml file is set up to run the differential evolution scanner. Is that on purpose, or should I be running multinest instead? Above you say
However, in a longer run, I get the same error as above (a different memory address, of course).
which points to multinest:
[gambit-np-mpi:21573] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4d51a95390]
[gambit-np-mpi:21573] [ 1] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__posterior_MOD_pos_samp+0x3a4d)[0x7f4d1f3f3bb8]
[gambit-np-mpi:21573] [ 2] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_clusterednest+0x21382)[0x7f4d1f427076]
Quick question: the yaml file is set up to run the differential evolution scanner. Is that on purpose, or should I be running multinest instead?
Sorry, I must have uploaded nonsense! :/ Changes to the above:
# printer: cout
# use_scanner: de
use_scanner: multinest
Running with mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml
the differential evolution scan finished successfully in ~15 minutes on my laptop.
No segfault, except the known one at the end (after GAMBIT has finished successfully!
)
Will try with multinest next.
Another quick question: what compiler versions did you use to build GAMBIT + the scanners? I'm currently testing with gcc/g++/gfortran version 8.4.0
(The multinest scan has been going for ~40 mins)
I used gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
[same version for the other compilers]
. The scan got a problem after ~0.5-2h, if I recall it correctly.
No error yet? I guess I will recompile with gcc 8.X then, because I just rerun and got an issue again.
Nope, no error so far. Has been running for almost 2.5 hours.
Btw, I'm running with export OMP_NUM_THREADS=1
. I just noticed that the your original stack trace points to libpthread
, so perhaps this matters for some reason.
OMP_NUM_THREADS=1 should not effect it, but I'll try, because gambit already fixes this. I'll ry before I recompile.
[...]
Starting GAMBIT
----------
Running in MPI-parallel mode with 2 processes
----------
Running with 1 OpenMP threads per MPI process (set by the environment variable O
MP_NUM_THREADS).
[...]
Hm, interesting. On my system GAMBIT automatically uses a maximum number of OpenMP threads unless I explicitly set OMP_NUM_THREADS=1
in that terminal session.
I've scraped the system and can't seem to reproduce my error now, regardless of whether I use gcc(...) 7 or 8. Sorry for bothering you and thanks for trying this!
Well that's great news! :) My multinest scan has been running for ~4 hours now. Currently the change in ln(Z) between updates is ~0.5, so I guess it's not too far away from being done...
Would you be interested in the results? I.e. should I let i finish? Or do you want to run a bigger scan yourself? Currently the hdf5 file contains ~1M samples.
Uh, no, here the hickup came back for mpiexec -np 4
with compilers of version 7.5. Version 8.4 is still running; and version 7 with mpiexec -np 2
. I'll update you after the weekend on success/fail.
Would you be interested in the results? I need to change the priors slightly, I have the results for the current priors when I ran without MPI. But thanks for the effort!
The issue might be something independent of gambit! I have installed gambit on one of NREC's (normal) VMs. They are overcommitted, I think also the memory, so this might explain why I can get "hickups" at seemingly arbitrary times. I'll find try to look up more on this and close this issue unless the problem comes up again.
Uh, no, here the hickup came back for mpiexec -np 4 with compilers of version 7.5.
Ah, too bad. Did the stack trace look the same?
The issue might be something independent of gambit! I have installed gambit on one of NREC's (normal) VMs. They are overcommitted, I think also the memory, so this might explain why I can get "hickups" at seemingly arbitrary times.
Ah OK. That might be it.
You may want ot try running through the GNU debugger gdb
. That can often give a more detailed stack trace. In case you haven't used it before: gdb --args ./gambit -rf yaml_files/ombit_demo.yaml
This gives you the gdb prompt. Type r
to start running GAMBIT. When it has crashed, type bt
to see the stack trace.
Check whether ompy runs with OpenMP internally -- and if it should do so, as Gambit already runs with OpenMP