JoshCu commented 4 months ago

Improve ngen-parallel Performance with T-route

Problem Statement

When running ngen-parallel with -n near or equal to core count, T-route's performance is severely degraded. T-route doesn't parallelize well with MPI, and moving the finalize step to before the routing frees up CPU resources for T-route to use different parallelization strategies.

Changes Made

Moved MPI_Finalize before the routing step
Applied additional T-route performance patch https://github.com/NOAA-OWP/t-route/pull/795

The troute change is semi related as it converts a for loop that consumes the majority of t-route's execution time to use multiprocessing. That performance improvement doesn't work while MPI wait is consuming every CPU cycle.

Performance Results

Testing setup: ~6500 catchments, 24 timesteps, dual Xeon 28C 56T total (same as #843)

Configuration	NGen::routing Time (s)
Serial	13.9106
Parallel (-n 96 unpatched)	37.2
Parallel (-n 56 unpatched)	21.6952
Parallel (-n 56 patched)	10.761
Parallel (-n 56 patched + T-route perf patch)	~6 (more testing needed)

96 was performed on a different 96 core machine 56 were all performed on a 56 core machine

Explanation

T-route has a step that transposes per catchment/nexus files into per timestep files.
This transpose step is the longest part of T-route's execution and scales poorly with more catchments or timesteps.
In the unpatched version, all ranks except zero progress to MPI_Finalize and then begin polling while waiting for the routing thread to finish.
This polling maxes out the CPU on every rank while rank 0 computes the routing.
Moving Finalize before routing ensures all threads are finished before routing begins, preventing CPU maxout.

Future Work

Consider reworking the file transposition step for more efficient I/O (out of scope for this change)

Testing

Used the same setup as #843
Manual runs show consistent improvement, but automated testing may be required for more accurate results (especially for simulation that shows a ~10% improvement with a similar deviation)

Performance Visualization

Performance Flamegraph Perf flamegraph of the entire ngen run (unpatched), should be interactive if downloaded

Additional Notes

Moving the finalize before timing log output appears to make the ngen simulation step ~10% faster, but may affect timing accuracy.
This use case exaggerates T-route performance degradation due to using 56 MPI ranks on 56 cores.
ngen was built using mpich 3.3.2. I saw online that changing the MPI wait policy to poll less frequently should help, but I couldn't get it to work
I'm not entirely sure of the implications of calling return 0; right after finalize for mpi_rank != 0;. I thought finalize would kill/return the subprocesses but without explicitly returning them I got segfaults. Reccomendations seem to be that as little as possible should be performed after calling finalize, but seeing as all computation after that point is done by one thread I can just return the others?

Next Steps

[ ] Conduct automated benchmarks to verify performance improvements to troute
[ ] Conduct automated benchmarks to verify performance improvements to ngen simulation

PhilMiller commented 4 months ago

I really like the idea here, but there's a fundamental challenge. As MPI is specified, we can't actually be confident that anything after a call to MPI_Finalize actually runs. I'll take a deeper look at whether there's a nicer way we can do this.

PhilMiller commented 4 months ago

Ok, I've looked at this a little bit more, and here's my tentative suggestion:

Rather than having all of the non-0 processes finalize and exit after the catchment simulation is complete, have them all call MPI_Ibarrier, followed by a loop of MPI_Test on the Ibarrier request and sleep for a decent time slice (e.g. 100 ms) until it's complete. Rank 0 would only enter the barrier after completing routing.

Is that something you want to try to implement? If not, I can find some time to work up an alternative patch. In the latter case, do you mind if I push it to your branch for this PR?

Ultimately, we expect to more deeply integrate t-route with parallelization directly into ngen with BMI. Until that's implemented, though, it's pretty reasonable to find expedient ways to improve performance.

JoshCu commented 4 months ago

Yeah I'll give it a go :) I might not get to it for another day or so but I'm happy to do it

JoshCu commented 4 months ago

This version works so far as the waiting mpi threads now no longer max out the CPU, but it's tricky to say what I'm actually measuring in terms of performance since this went through https://github.com/NOAA-OWP/t-route/pull/795 I need to run a test on some data with more than 25 timesteps as the difference with and without this fix is ~<1s, ~9s vs ~10s.

I also wasn't sure if manager->finalize(); could go before the barrier or after it? If the routing was using MPI, would we need to wait for all threads to reach the post routing barrier before calling finalize?

JoshCu commented 4 months ago

Testing this on some longer runs, ~6500 catchments cfe + noaa owp for 6 months, shows that speedup isn't as dramatic now that the routing has that multiprocessing change in it.

Before MPI change

Finished routing
NGen top-level timings:
    NGen::init: 21.621
    NGen::simulation: 537.511
    NGen::routing: 837.387

real    23m17.312s
user    1836m41.492s
sys 102m31.267s

After change

    NGen::init: 21.6584
    NGen::simulation: 529.272
    NGen::routing: 783.689

real    22m15.342s
user    640m16.918s
sys 164m18.954s

JoshCu commented 2 months ago

Performance Testing Results

In summary: it makes routing ~1.3x-1.4x faster Apologies for the inconsistent performance testing, I was using a terrible t-route configuration. I've adjusted it some more to run as fast as I can get it then reran some larger tests

Test Setup

Cores: 96
Subnetwork target size: 100 (determined to be the fastest from multiple tests)
t-route output format: NetCDF
Catchments: 60,000
Time period: 240 hours (in five-minute timesteps in t-route)

Results

t-route run standalone in NGIAB
- Method: Running python -m nwm_routing manually
- Execution time: 115 seconds
t-route run via ngen in NGIAB using ngen master
- Execution time: 161 seconds
t-route run via ngen in NGIAB using this PR
- Execution time: 120 seconds

2024-09-04 23:27:24,582 - root - INFO - [__main__.py:340 - main_v04]: ************ TIMING SUMMARY ************
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:341 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:342 - main_v04]: Network graph construction: 20.63 secs, 17.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:349 - main_v04]: Forcing array construction: 29.57 secs, 24.55 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:356 - main_v04]: Routing computations: 59.15 secs, 49.09 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:363 - main_v04]: Output writing: 10.99 secs, 9.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:370 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:371 - main_v04]: Total execution time: 120.33999999999999 secs
Finished routing
NGen top-level timings:
    NGen::init: 103.262
    NGen::simulation: 234.654
    NGen::routing: 120.578

and that init time can be reduced to 26 seconds this https://github.com/NOAA-OWP/ngen/compare/master...JoshCu:ngen:open_files_first_ask_questions_later

JoshCu commented 3 weeks ago

Recording this here with for reference: I've been doing other testing and the first mpi barrier after model execution, before the routing also has the same issue. On a single machine, when a rank completes simulation it polls as fast as possible, pinning that core at 100% which results in my machine lowering the clock speed of the cpu. It's only slower because of the reduction in clock speed and not as a result of other processes being unable to use the cores (like with t-route) so it's <5% slowdown. It's a marginal difference on a desktop and I'm guessing this wouldn't be a problem on servers with fixed clock speed cpus. Might save some electricity though!

NOAA-OWP / ngen

Remove MPI_Barriers before routing to increase speed. #846

Improve ngen-parallel Performance with T-route

Problem Statement

Changes Made

Performance Results

Explanation

Future Work

Testing

Performance Visualization

Additional Notes

Next Steps

Before MPI change

After change

Performance Testing Results

Test Setup

Results