cctbx / cctbx_project

Computational Crystallography Toolbox
https://cci.lbl.gov/docs/cctbx
Other
218 stars 116 forks source link

XFEL GUI updates #992

Closed phyy-nx closed 4 months ago

phyy-nx commented 4 months ago

Series of updates from most recent beamtime:

Note 1f4c25c9b89b92aba5873d3f5d37ebaf5c749328 switches over to using libtbx.mpi4py everywhere in cctbx, including simtbx.

dermen commented 4 months ago

Interesting edge case on Summit compute cluster

$ jsrun -n 1 python test_mpi4py.py 
0 1

$ python test_mpi4py.py 
Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be.
Error checking ibm license.

$ cat test_mpi4py.py 
from libtbx.mpi4py import MPI
COMM = MPI.COMM_WORLD
print(COMM.rank, COMM.size)

I would hope that from libtbx.python import MPI could catch the failure, but seems not ?

phyy-nx commented 4 months ago

I would hope that from libtbx.python import MPI could catch the failure, but seems not ?

libtbx.mpi4py is designed to emulate MPI on systems without it, such that at least the program can run single process. It is not designed to catch or fix a broken MPI.

That said, we have found ourselves recently, on more than one occasion, in a situation such that on the login nodes for a given computing cluster, MPI is broken, but on the compute nodes, it works fine. In programs such as the XFEL GUI, this meant that if any import triggered by the XFEL GUI brings in mpi4py, the GUI would break on these login nodes. This PR includes 3 steps to mitigate this problem, without addressing the underlying reason that MPI was broken on that cluster:

With these changes, your example could be changed to this, similar to the new guard in the XFEL GUI, and it would run without a warning on a login node or without jsrun:

import libtbx
libtbx.mpi_import_guard.disable_mpi = True
from libtbx.mpi4py import MPI
COMM = MPI.COMM_WORLD
print(COMM.rank, COMM.size)

But of course, MPI would be disabled.