ExaFEL / exafel_project

ExaFEL project to be included in CCTBX modules
https://exafel.github.io/docs
0 stars 2 forks source link

Mp big objects #14

Closed Baharis closed 1 year ago

Baharis commented 1 year ago

This PR eliminates mpi4py communication issues when trying to send large amount of structure factors (>2GB) between ranks. Without this PR, attempting to simulate PSII data up to 3.0 Angstrom resolution using kpp_utils/LY99_batch.py results in an OverflowError or SystemError. After PR, the structure factor dictionaries are being sent value-by-value, which might slightly slow down the communication, but prevents said errors. Attempting to model extremely large datasets might still fail due to memory issues at a later stage if there is not enough memory to store everything on a single rank. The following table shows approximate execution status and time when simulating 100 PSII frames on 4 tasks on Perlmutter GPU node before and after the PR:

Res. Before After
3.5 Å Terminates successfully after 7 min Terminates successfully after 8 min
3.0 Å Raises OverflowError and hangs after 8 min Terminates successfully after 9 min
1.5 Å Raises SystemError and hangs after 30 min Terminates successfully after 30 min

This table documents only a small subset of performed tests. For a PSII job finished on this branch, see directories /global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/SIM/0simulate and /pscratch/sd/d/dtchon/psii_sim/8707131/.

Summary of individual changes:

Baharis commented 1 year ago

Note: When writing this PR I assumed that this code will never be run using Python2 and, consequently, included some Python3.6-specific syntax (f-strings, type hints) which improves readability.

Baharis commented 1 year ago

As mentioned in #15 , I noticed that I will need to revert all changes associated with logging because the log must be computer-readable, whereas I made it human-readable.

Baharis commented 1 year ago

Converting this PR to a draft in order to restrict all introduced changes strictly to MPI updates. I will convert it back and request review when all the cleaning is done.

Baharis commented 1 year ago

I had to solve some issues while rebasing and force-push afterwards, but it looks like the branch is now in a desirable state. Opening back for review.

Baharis commented 1 year ago

Reviewed and accepted by Nick for a squash-merge.