Closed jmstone closed 3 months ago
In GitLab by @jfields7 on Jun 25, 2024, 16:25
requested review from @jmstone216 and @dradice
In GitLab by @jmstone216 on Jun 26, 2024, 08:41
This is a useful fix, and I am happy to merge. However, why not just have rank=0 measure the wall clock time and save the MPI_AllReduce (and yet another all-to-all communication). Is there a reason why rank 0 would measure a significantly smaller time than other ranks? Since it generally has more work to do, I would think it consistently accumulates the most time.
In GitLab by @jfields7 on Jun 26, 2024, 10:56
added 1 commit
In GitLab by @jfields7 on Jun 26, 2024, 10:59
Sure, I think that makes sense. I've replaced the MPI_Allreduce
with MPI_Bcast
from rank 0. Perlmutter is down for maintenance at the moment, but an MPI test with CPUs seemed to suggest it works the same as before.
In GitLab by @jmstone216 on Jun 26, 2024, 12:06
approved this merge request
In GitLab by @jmstone216 on Jun 26, 2024, 12:06
mentioned in commit ccaf7ffdad25c5da593b2a8ba51552c949a8ab44
In GitLab by @jfields7 on Jun 25, 2024, 16:25
Merges wall-clock-fix -> master
There's been a consistent problem with the code not always successfully quitting when the wall clock timer runs out, particularly if output or restart files are needed. This is frustrating for large runs, especially if the last checkpoint is several hours before the end of the run.
On further investigation, the issue may be related to how the wall clock timer is updated inside
Driver::Execute
:Each MPI rank has its own wall clock, and there is no guarantee a) that these timers are initialized at exactly the same time or b) that these timers will be updated at the same time. Though the times should be approximately the same, it's very possible for one or more ranks to measure a shorter wall time than the rest, especially if initialization takes somewhat longer on some ranks. Consequently, the
while
loop may terminate for some ranks but not for others, leading to the code hanging indefinitely until the process is killed.This merge request adds a new function,
Driver::UpdateWallClock()
, which measures the time and callsMPI_Allreduce
to ensure that the latest time measured across all ranks is assigned toelapsed_time
instead.Admittedly, there is no way to prove 100% that this fixes the stalling issue, but I did five trials of a torus test designed to run for 30 minutes on 2 nodes on Perlmutter. The single trial without the fix stalled out and had to be killed 15 minutes after the wall clock should have triggered. The four trials with the fix successfully quit less than a minute after the wall clock triggered. I'm happy to run more tests if desired.