Closed Baharis closed 1 year ago
Note: When writing this PR I assumed that this code will never be run using Python2 and, consequently, included some Python3.6-specific syntax (f-strings, type hints) which improves readability.
As mentioned in #15 , I noticed that I will need to revert all changes associated with logging because the log must be computer-readable, whereas I made it human-readable.
Converting this PR to a draft in order to restrict all introduced changes strictly to MPI updates. I will convert it back and request review when all the cleaning is done.
I had to solve some issues while rebasing and force-push afterwards, but it looks like the branch is now in a desirable state. Opening back for review.
Reviewed and accepted by Nick for a squash-merge.
This PR eliminates
mpi4py
communication issues when trying to send large amount of structure factors (>2GB) between ranks. Without this PR, attempting to simulate PSII data up to 3.0 Angstrom resolution usingkpp_utils/LY99_batch.py
results in anOverflowError
orSystemError
. After PR, the structure factor dictionaries are being sent value-by-value, which might slightly slow down the communication, but prevents said errors. Attempting to model extremely large datasets might still fail due to memory issues at a later stage if there is not enough memory to store everything on a single rank. The following table shows approximate execution status and time when simulating 100 PSII frames on 4 tasks on Perlmutter GPU node before and after the PR:OverflowError
and hangs after 8 minSystemError
and hangs after 30 minThis table documents only a small subset of performed tests. For a PSII job finished on this branch, see directories
/global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/SIM/0simulate
and/pscratch/sd/d/dtchon/psii_sim/8707131/
.Summary of individual changes:
bcast_large_dict
, broadcastsfall_info
separately fromtransmitted_info
and add it afterwards;bcast_large_dict
andcollect_large_dict
to broadcast and gather+reconstruct dictionaries by passing their items one-by-one;collect_large_dict
and use it to gather allsfall_channels
from individual ranks.