Check whether ompy runs with OpenMP internally -- and if it should

fzeiser commented 4 years ago

Check whether ompy runs with OpenMP internally -- and if it should do so, as Gambit already runs with OpenMP

anderkve commented 4 years ago

This is a good question. The core of GAMBIT does not make use of OpenMP threads itself, but internally in the various Bits you can use use OpenMP threading. (But we currently don't in OMBit.) I'm not sure what happens if a bakend tries to use OpenMP -- it would certainly be interesting to test if it works, or if (and how) it fails.

fzeiser commented 4 years ago

As of now, the gledeli backend uses OpenMP (though ompy) here; https://github.com/anderkve/gambit_np/blob/c811eaef81eb286cd65bf0f70c16876d370b53c8/Backends/installed/gledeli/1.0/gledeli/lnlike_firstgen.py#L72-L74

Actually, it depends on the compilation settings of ompy whether it uses OpenMP there, but I has the possibility to use it (it was difficult to get OpenMP running on macs). The way I run gambit now, I get 8 OpenMP threads, which I can confirm that they are used on the machine I run it. They were not used before I added nld_T_product, so they are attributable to the backend.

Starting GAMBIT
----------
WARNING! Running in SERIAL (no MPI) mode! Recompile with -DWITH_MPI=1 for MPI parallelisation
----------
Running with 8 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).

anderkve commented 4 years ago

Cool! Assuming that the results you got with the 8 OpenMP threads are as you would expect running outside of GAMBIT, a next question is if this still works when we run with MPI-parallelised scanning in GAMBIT.

You can give it at try by rebuilding GAMBIT with -DWITH_MPI=True in the cmake command, and then launch GAMBIT with something like mpiexec -np <number of MPI processes> ./gambit -rf yaml_files/ombit_demo.yaml

fzeiser commented 4 years ago

I just have to wait for the current run to finish, which will take some time. So I can check whether it still works (though not necessarily very smart, as only one of the likelihood functions is OpenMP capable)

fzeiser commented 4 years ago

Assuming that the results you got with the 8 OpenMP threads are as you would expect running outside of GAMBIT

What do you mean by this?

anderkve commented 4 years ago

Ah, just that the result you get for a given parameter point with GAMBIT+GLEDEli is the same as you would get running GLEDELi as a stand-alone tool outside of a GAMBIT scan. I'm sure it is, though -- just wanted to be sure that it's working as expected before moving on to including MPI. (I've seen some weird threading/memory bugs in my days...)

fzeiser commented 4 years ago

I'm sure it is, though -- just wanted to be sure that it's working as expected before moving on to including MPI. (I've seen some weird threading/memory bugs in my days...)

Scary! I will check that, too. I realized I can just create one more VM to run GAMBIT on and check this simultaneously ;P.

fzeiser commented 4 years ago

When compiling gambit without MPI support, GAMBIT + GELDELi gives the same results as the same parameters run through GLEDELi directly. Checked against a parameter set from random scanner. [No-MPI; OpenMP: 8 cores]

Now I've tried to recompile it with MPI and I run into some other errors. I get a segmentation fault at the end of the run (After GAMBIT has finished successfully!). I had to modify the system quite a bit to be able to get the MPI compilation running. After all these modifications, the segmentation fault appears both for compilation with and without MPI support. :/ [The calculations seem to run just fine before; I get the logs of each MPI process etc. so I'll just assume for now things work fine and report the results when assuming this]

fzeiser commented 4 years ago

Assuming that the MPI works fine and that the error is related to something else(?), the calculations also match with MPI. I tried the same as above, taking some parameters form OMBit.log_2 and feeding them to gledeliBE.py. It returned the same likelihoods.

anderkve commented 4 years ago

This is very good news! :)

The segfault at the end sounds like a known issue that came up a month or two ago in GAMBIT. I'm pretty sure it was due to some MPI-related object in GAMBIT not being properly deleted before the MPI machinery is shut down. I'll have a look at if/how that was fixed and try to port that to our code. But it doesn't actually affect anything, so we can just keep on going with MPI.

fzeiser commented 4 years ago

Thanks for checking this out. Actually, on a longer run (with MultiNest instead of random), I encountered the segmentation fault during the run. Maybe it's the same problem. I hope so ;). Tired of fixing hickups. But that's part of the game I guess.

Do you have any "standard setup"? LIke a docker image, or other "usual setup" you run gambit with and know that it is as stable as possible there?

[...]
Nested Sampling ln(Z):              -8321.759285
Importance Nested Sampling ln(Z):   -4748.853431 +/-  0.999989
Acceptance Rate:                        0.228910
Replacements:                              10900
Total Samples:                             47617
Nested Sampling ln(Z):              -7937.874406
Importance Nested Sampling ln(Z):   -4520.457825 +/-  0.999989
[gambit-np-mpi:21573] *** Process received signal ***
[gambit-np-mpi:21573] Signal: Segmentation fault (11)
[gambit-np-mpi:21573] Signal code: Invalid permissions (2)
[gambit-np-mpi:21573] Failing at address: 0x7f4d36951000
[gambit-np-mpi:21573] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4d51a95390]
[gambit-np-mpi:21573] [ 1] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__posterior_MOD_pos_samp+0x3a4d)[0x7f4d1f3f3bb8]
[gambit-np-mpi:21573] [ 2] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_clusterednest+0x21382)[0x7f4d1f427076]
[gambit-np-mpi:21573] [ 3] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_nestsample+0x10b7)[0x7f4d1f42c0b5]
[gambit-np-mpi:21573] [ 4] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_nestrun+0x1970)[0x7f4d1f42de31]
[gambit-np-mpi:21573] [ 5] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(run+0x25a)[0x7f4d1f42e0eb]
[gambit-np-mpi:21573] [ 6] /home/ubuntu/gambit_np/ScannerBit/lib/libscanner_multinest_3.11.so(_ZN60__gambit_plugin_multinest__t__scanner__v__3_11___namespace__11PLUGIN_MAINEv+0x1146)[0x7f4d1f664ee1]
[gambit-np-mpi:21573] [ 7] ./gambit[0x97de3d]
[gambit-np-mpi:21573] [ 8] ./gambit[0x979ee6]
[gambit-np-mpi:21573] [ 9] ./gambit[0x974a87]
[gambit-np-mpi:21573] [10] ./gambit[0x6a4de3]
[gambit-np-mpi:21573] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4d4f1af830]
[gambit-np-mpi:21573] [12] ./gambit(_start+0x29)[0x65f6f9]
[gambit-np-mpi:21573] *** End of error message ***
rank 4: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 5: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 6: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 7: Tried to synchronise for shutdown (attempt 11) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
rank 7: Tried to synchronise for shutdown (attempt 21) but failed. Will now fast-forward through 1000 iterations in an attempt to 'unlock' possible MPI deadlocks with the scanner.
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 21573 on node gambit-np-mpi exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

anderkve commented 4 years ago

I will try to reproduce this

fzeiser commented 4 years ago

ToDo:

try only with cout printern

anderkve commented 4 years ago

I get a segmentation fault at the end of the run (After GAMBIT has finished successfully!).

I can at least reproduce this segfault at the end, both with multinest and the random scanner. Interestingly it happens even when I only use the cout printer. Will keep investigating...

anderkve commented 4 years ago

Two questions related to the MPI problems:

Does running mpiexec -np 2 ./gambit -rf yaml_files/spartan.yaml work for you? For me this example works with MPI, with the hdf5 printer and with all three main scanners (random, de, multinest). (Note that you need to delete the folder in runs/spartan in between attempts.)
Currently it seems that gledeli imports pymultinest via ompy. This seems to be causing problems for me. Is it necessary for gledeli to pull in pymultinest via ompy, or can we avoid this?

fzeiser commented 4 years ago

Currently it seems that gledeli imports pymultinest via ompy. This seems to be causing problems for me. Is it necessary for gledeli to pull in pymultinest via ompy, or can we avoid this?

It is not necessary, it's a side effect of how ompy is written. It's on the list of things to change, so now it'll just have to change sooner ;P.

anderkve commented 4 years ago

Ah, cool. Thanks. :)

fzeiser commented 4 years ago

Here is a mokeypatch until I will have rewritten things correctly bb80610 on the branch fix/mock_pymultinest just mocks pymultinest instead of importing it. Could you try to run with it (the VM with gambit just crashed/doesn't want to restart, so I need to fix this)

anderkve commented 4 years ago

Thanks, that was quick! :)

That helped a lot. Now I see the following:

Simply by running diagnostics commands such as ./gambit scanners or ./gambit backends, I get a segfault at the end. This does not happen if I deactivate gledeliBE by commenting it out from config/backend_locations.yaml.default. And it does happen if I replace gledeliBE.py with a completely empty dummy python file.

So in conclusion, the segfault after GAMBIT finishes is related to Pybind11 and the closing of the python interpreter (which only runs if there is at least one python backend connected) at the end of a GAMBIT run, and has nothing to do with the hdf5 printer, or with gledeli. It is probably the same problem that I saw emails about today on the GAMBIT email lists today. So this will go away when it's fixed in GAMBIT (or in Pybind11) and we re-connect our gambit_np development with the main GAMBIT repo.

Now, ignoring the segfault at the end, I can successfully run mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml (reduced to a quick 1D scan for speed) with all the following combinations of scanner + printer:

random + cout
random + hdf5
multinest + cout
multinest + hdf5
de + cout
de + hdf5

So this seems very promising :)

fzeiser commented 4 years ago

Perfect. Let's keep this open until we have found out whether GLEDELi should be included via NUCBit / NUCLEARBit in GAMBIT or whether we fork GAMBIT.

fzeiser commented 4 years ago

I can now run the sample (mocking pymultinest) with the system described in #18. Thanks a lot!

However, in a longer run, I get the same error as above (a different memory address, of course). Let's see what will happen once we implement the changes.

fzeiser commented 4 years ago

@anderkve Can you reproduce the error? mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml on ea30b8ce65998b5b5a1530d49eded6053e58fcd7 with ombit_demo.yaml.txt. Note that I had to rename the file for the upload here.

The error comes after some time, maybe ~30min.

anderkve commented 4 years ago

Trying to reproduce the error now. Quick question: the yaml file is set up to run the differential evolution scanner. Is that on purpose, or should I be running multinest instead? Above you say

However, in a longer run, I get the same error as above (a different memory address, of course).

which points to multinest:

[gambit-np-mpi:21573] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4d51a95390]
[gambit-np-mpi:21573] [ 1] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__posterior_MOD_pos_samp+0x3a4d)[0x7f4d1f3f3bb8]
[gambit-np-mpi:21573] [ 2] /home/ubuntu/gambit_np/ScannerBit/installed/multinest/3.11/libnest3.so(__nested_MOD_clusterednest+0x21382)[0x7f4d1f427076]

fzeiser commented 4 years ago

Quick question: the yaml file is set up to run the differential evolution scanner. Is that on purpose, or should I be running multinest instead?

Sorry, I must have uploaded nonsense! :/ Changes to the above:

# printer: cout
# use_scanner: de
use_scanner: multinest

anderkve commented 4 years ago

Running with mpiexec -np 2 ./gambit -rf yaml_files/ombit_demo.yaml the differential evolution scan finished successfully in ~15 minutes on my laptop.

No segfault, except the known one at the end (after GAMBIT has finished successfully!)

Will try with multinest next.

anderkve commented 4 years ago

Another quick question: what compiler versions did you use to build GAMBIT + the scanners? I'm currently testing with gcc/g++/gfortran version 8.4.0

(The multinest scan has been going for ~40 mins)

fzeiser commented 4 years ago

I used gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 [same version for the other compilers] . The scan got a problem after ~0.5-2h, if I recall it correctly.

fzeiser commented 4 years ago

No error yet? I guess I will recompile with gcc 8.X then, because I just rerun and got an issue again.

anderkve commented 4 years ago

Nope, no error so far. Has been running for almost 2.5 hours.

Btw, I'm running with export OMP_NUM_THREADS=1. I just noticed that the your original stack trace points to libpthread, so perhaps this matters for some reason.

fzeiser commented 4 years ago

OMP_NUM_THREADS=1 should not effect it, but I'll try, because gambit already fixes this. I'll ry before I recompile.

[...]
Starting GAMBIT                                                                 
----------                                                                      
Running in MPI-parallel mode with 2 processes                                   
----------                                                                      
Running with 1 OpenMP threads per MPI process (set by the environment variable O
MP_NUM_THREADS). 
[...]

anderkve commented 4 years ago

Hm, interesting. On my system GAMBIT automatically uses a maximum number of OpenMP threads unless I explicitly set OMP_NUM_THREADS=1 in that terminal session.

fzeiser commented 4 years ago

I've scraped the system and can't seem to reproduce my error now, regardless of whether I use gcc(...) 7 or 8. Sorry for bothering you and thanks for trying this!

anderkve commented 4 years ago

Well that's great news! :) My multinest scan has been running for ~4 hours now. Currently the change in ln(Z) between updates is ~0.5, so I guess it's not too far away from being done...

Would you be interested in the results? I.e. should I let i finish? Or do you want to run a bigger scan yourself? Currently the hdf5 file contains ~1M samples.

fzeiser commented 4 years ago

Uh, no, here the hickup came back for mpiexec -np 4 with compilers of version 7.5. Version 8.4 is still running; and version 7 with mpiexec -np 2. I'll update you after the weekend on success/fail.

Would you be interested in the results? I need to change the priors slightly, I have the results for the current priors when I ran without MPI. But thanks for the effort!

fzeiser commented 4 years ago

The issue might be something independent of gambit! I have installed gambit on one of NREC's (normal) VMs. They are overcommitted, I think also the memory, so this might explain why I can get "hickups" at seemingly arbitrary times. I'll find try to look up more on this and close this issue unless the problem comes up again.

anderkve commented 4 years ago

Uh, no, here the hickup came back for mpiexec -np 4 with compilers of version 7.5.

Ah, too bad. Did the stack trace look the same?

The issue might be something independent of gambit! I have installed gambit on one of NREC's (normal) VMs. They are overcommitted, I think also the memory, so this might explain why I can get "hickups" at seemingly arbitrary times.

Ah OK. That might be it.

You may want ot try running through the GNU debugger gdb. That can often give a more detailed stack trace. In case you haven't used it before: gdb --args ./gambit -rf yaml_files/ombit_demo.yaml This gives you the gdb prompt. Type r to start running GAMBIT. When it has crashed, type bt to see the stack trace.

anderkve / gledeli

Check whether ompy runs with OpenMP internally -- and if it should #16