Synchronize wall clock timer across MPI tasks - [merged] #572

Closed jmstone closed 3 months ago

In GitLab by @jfields7 on Jun 25, 2024, 16:25

Merges wall-clock-fix -> master

There's been a consistent problem with the code not always successfully quitting when the wall clock timer runs out, particularly if output or restart files are needed. This is frustrating for large runs, especially if the last checkpoint is several hours before the end of the run.

On further investigation, the issue may be related to how the wall clock timer is updated inside Driver::Execute:

while ((pmesh->time < tlim) && (pmesh->ncycle < nlim || nlim < 0) &&
       (elapsed_time < wall_time)) {
  ...
  if (wall_time > 0) {
    elapsed_time = pwall_clock_->seconds();
  }
}

Each MPI rank has its own wall clock, and there is no guarantee a) that these timers are initialized at exactly the same time or b) that these timers will be updated at the same time. Though the times should be approximately the same, it's very possible for one or more ranks to measure a shorter wall time than the rest, especially if initialization takes somewhat longer on some ranks. Consequently, the while loop may terminate for some ranks but not for others, leading to the code hanging indefinitely until the process is killed.

This merge request adds a new function, Driver::UpdateWallClock(), which measures the time and calls MPI_Allreduce to ensure that the latest time measured across all ranks is assigned to elapsed_time instead.

Admittedly, there is no way to prove 100% that this fixes the stalling issue, but I did five trials of a torus test designed to run for 30 minutes on 2 nodes on Perlmutter. The single trial without the fix stalled out and had to be killed 15 minutes after the wall clock should have triggered. The four trials with the fix successfully quit less than a minute after the wall clock triggered. I'm happy to run more tests if desired.

In GitLab by @jfields7 on Jun 25, 2024, 16:25

requested review from @jmstone216 and @dradice

In GitLab by @jmstone216 on Jun 26, 2024, 08:41

This is a useful fix, and I am happy to merge. However, why not just have rank=0 measure the wall clock time and save the MPI_AllReduce (and yet another all-to-all communication). Is there a reason why rank 0 would measure a significantly smaller time than other ranks? Since it generally has more work to do, I would think it consistently accumulates the most time.

In GitLab by @jfields7 on Jun 26, 2024, 10:56

added 1 commit

08f5e825 - Replace MPI_Allreduce with MPI_Bcast.

Compare with previous version

In GitLab by @jfields7 on Jun 26, 2024, 10:59

Sure, I think that makes sense. I've replaced the MPI_Allreduce with MPI_Bcast from rank 0. Perlmutter is down for maintenance at the moment, but an MPI test with CPUs seemed to suggest it works the same as before.

In GitLab by @jmstone216 on Jun 26, 2024, 12:06

approved this merge request

In GitLab by @jmstone216 on Jun 26, 2024, 12:06

mentioned in commit ccaf7ffdad25c5da593b2a8ba51552c949a8ab44

IAS-Astrophysics / athenak

Synchronize wall clock timer across MPI tasks - [merged] #572