Open JoshCu opened 4 months ago
I really like the idea here, but there's a fundamental challenge. As MPI is specified, we can't actually be confident that anything after a call to MPI_Finalize
actually runs. I'll take a deeper look at whether there's a nicer way we can do this.
Ok, I've looked at this a little bit more, and here's my tentative suggestion:
Rather than having all of the non-0 processes finalize and exit after the catchment simulation is complete, have them all call MPI_Ibarrier
, followed by a loop of MPI_Test
on the Ibarrier
request and sleep
for a decent time slice (e.g. 100 ms) until it's complete. Rank 0 would only enter the barrier after completing routing.
Is that something you want to try to implement? If not, I can find some time to work up an alternative patch. In the latter case, do you mind if I push it to your branch for this PR?
Ultimately, we expect to more deeply integrate t-route with parallelization directly into ngen with BMI. Until that's implemented, though, it's pretty reasonable to find expedient ways to improve performance.
Yeah I'll give it a go :) I might not get to it for another day or so but I'm happy to do it
This version works so far as the waiting mpi threads now no longer max out the CPU, but it's tricky to say what I'm actually measuring in terms of performance since this went through https://github.com/NOAA-OWP/t-route/pull/795 I need to run a test on some data with more than 25 timesteps as the difference with and without this fix is ~<1s, ~9s vs ~10s.
I also wasn't sure if manager->finalize();
could go before the barrier or after it? If the routing was using MPI, would we need to wait for all threads to reach the post routing barrier before calling finalize?
Testing this on some longer runs, ~6500 catchments cfe + noaa owp for 6 months, shows that speedup isn't as dramatic now that the routing has that multiprocessing change in it.
Finished routing
NGen top-level timings:
NGen::init: 21.621
NGen::simulation: 537.511
NGen::routing: 837.387
real 23m17.312s
user 1836m41.492s
sys 102m31.267s
NGen::init: 21.6584
NGen::simulation: 529.272
NGen::routing: 783.689
real 22m15.342s
user 640m16.918s
sys 164m18.954s
In summary: it makes routing ~1.3x-1.4x faster Apologies for the inconsistent performance testing, I was using a terrible t-route configuration. I've adjusted it some more to run as fast as I can get it then reran some larger tests
t-route run standalone in NGIAB
python -m nwm_routing
manuallyt-route run via ngen in NGIAB using ngen master
t-route run via ngen in NGIAB using this PR
2024-09-04 23:27:24,582 - root - INFO - [__main__.py:340 - main_v04]: ************ TIMING SUMMARY ************
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:341 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:342 - main_v04]: Network graph construction: 20.63 secs, 17.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:349 - main_v04]: Forcing array construction: 29.57 secs, 24.55 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:356 - main_v04]: Routing computations: 59.15 secs, 49.09 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:363 - main_v04]: Output writing: 10.99 secs, 9.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:370 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:371 - main_v04]: Total execution time: 120.33999999999999 secs
Finished routing
NGen top-level timings:
NGen::init: 103.262
NGen::simulation: 234.654
NGen::routing: 120.578
and that init time can be reduced to 26 seconds this https://github.com/NOAA-OWP/ngen/compare/master...JoshCu:ngen:open_files_first_ask_questions_later
Recording this here with for reference: I've been doing other testing and the first mpi barrier after model execution, before the routing also has the same issue. On a single machine, when a rank completes simulation it polls as fast as possible, pinning that core at 100% which results in my machine lowering the clock speed of the cpu. It's only slower because of the reduction in clock speed and not as a result of other processes being unable to use the cores (like with t-route) so it's <5% slowdown. It's a marginal difference on a desktop and I'm guessing this wouldn't be a problem on servers with fixed clock speed cpus. Might save some electricity though!
Improve ngen-parallel Performance with T-route
Problem Statement
When running ngen-parallel with
-n
near or equal to core count, T-route's performance is severely degraded. T-route doesn't parallelize well with MPI, and moving the finalize step to before the routing frees up CPU resources for T-route to use different parallelization strategies.Changes Made
The troute change is semi related as it converts a for loop that consumes the majority of t-route's execution time to use multiprocessing. That performance improvement doesn't work while MPI wait is consuming every CPU cycle.
Performance Results
Testing setup: ~6500 catchments, 24 timesteps, dual Xeon 28C 56T total (same as #843)
96 was performed on a different 96 core machine 56 were all performed on a 56 core machine
Explanation
Future Work
Testing
Performance Visualization
Perf flamegraph of the entire ngen run (unpatched), should be interactive if downloaded
Additional Notes
return 0;
right after finalize formpi_rank != 0;
. I thought finalize would kill/return the subprocesses but without explicitly returning them I got segfaults. Reccomendations seem to be that as little as possible should be performed after calling finalize, but seeing as all computation after that point is done by one thread I can just return the others?Next Steps