awlostowski-noaa commented 2 years ago

The current configuration of MC reach conducts sequential routing calculations inside of three nested loops. The outer-most loop iterates over timesteps, the intermediate loop marches through each reach of the network (upstream to downstream), and the inner most loop iterates over each segment in the reach. This loop is written in Cython. Within this loop, a Fortran subroutine is called to compute a single-segment single-timestep routing calculation.

Here is a cartoon of what happens in mc_reach.pyx

while t <= nsteps:
    for r in network_reaches:
        for s in reach_segments:
            call muskingum(params)
    t += 1

I think we can make this faster by re-organizing the loop structure and pushing the time and segment iterations to Fortran. Here is a cartoon of a prototype implementation I am trying out in a development version of mc_reach

for r in network_reaches

    # muskingum_new is a Fortran subroutine that computes routing for 
    # ALL segments in a single reach for ALL timesteps of the simulation.
    call muskingum_new(nsteps, reach_segments, params)

This approach will drastically reduce the number of times we need to traverse the cython-Fortran interface by doing more looping in Fortran, less in Cython.

Initial testing is showing promise. A 1000 timestep MC channel only simulation on a network with 11,141 segments takes 32 seconds with the existing implementation (triple-nested loop) and 29 seconds with the newly proposed structure. Not a huge speed up, but certainly a nice proof of concept that justifies further investigation. Moreover, the prototype of the new approach produces bit-for-bit the exact same answer as the existing approach.

One difficulty associated with implementing the newly proposed structure is passing arrays of data between Cython to Fortran. For example, we now need to pass a 2d array of lateral inflow data (nsegs x nsteps) from Cython to Fortran, as opposed to the single floating point value in the existing implementation. Similarly, we now need to pass 1d arrays of channel parameters (nsegs), rather than single floating point values. It took me a while to understand that C and Fortran have different ways of representing arrays in memory, but I was able to figure all of that out. What I am struggling with now is optimizing my code so that I pass arrays to and fro in the most efficient way possible. I think some careful optimization can yield far greater boosts in efficiency - but that lies at or beyond the limits of my knowledge. Looking for help @groutr @hellkite500 @jameshalgren!

Profiling shows that the vast majority of time spent in mc_reach.compute_network_structured is within the compute_reach_kernel function, which is good because that is where all the physics action is happening. So, if we can optimize the structure of that function in my development branch, than it will be a boon for overall t-route MC performance. Here is how I've re-written that function for the newly proposed implementation.

awlostowski-noaa commented 2 years ago

@BrianAvant-NOAA FYI

jameshalgren commented 2 years ago

@awlostowski-noaa First, let me restate how grateful and impressed I am at how you have carried this forward. If any of the following seems unhelpful, let me know.

At this point, with so much to shoehorn into the operational deployment schedule, it will matter to know just how much juice we are trying to squeeze out of this optimization. How much do we need to win back and by when? Do the optimizations need to apply in one particular mode or another?

If we can gain 5-10% from this optimization but have to completely scramble the code layout in order to figure this out in a week or two, it might not be worth it, especially if the target is 25%+ improvement, for instance.

We might consider porting more of the code to native C, and we might see about switching to an optimized compiler... has anyone successfully run T-route with the Intel compiler? Do we have access to pgcc to try?

awlostowski-noaa commented 2 years ago

@jameshalgren I don't see this as completely scrambling the code layout. The basics of the layout are preserved, we are just rethinking the looping structure. Moreover, we are already on the other side of that hill, insofar as I've already figured out how to pass arrays to the compute kernels and get the correct answers back. Also, we are already seeing a ~10% bump in performance, which is definitely worth pursuing further. I am not concerned with this derailing ongoing development - the changes are constrained to mc_reach and MC-related modules called from there, so we can tinker in a dev branch and without disrupting testing and other development fronts.

I will reiterate that while I am getting the correct answers and seeing some increase in performance, I think we can be more efficient with my implementation. For example, I can't quite figure out how to build a q lateral array in such as way that when passed to the Fortran compute kernel has the correct memory ordering. In light of this, I use a rather hacky approach to transpose the q lateral array based on the number of segments in the reach:

    if nseg == 1:
        lateral_flows_F = lateral_flows[:nseg,:].T
    else:
        lateral_flows_F = np.asfortranarray(lateral_flows[:nseg,:])

surely there is a better way to do this that doesn't require creating a new variable and transposing an existing one.

Similarly, I can't quite figure out how to write results directly to the output buffer from the compute kernel. As such, I need to create a new output variable with Fortran-ordered memory to capture return data, then write those data over to the output buffer. All of that nonsense happens in here, specifically with the creation and use of the variable out.

jameshalgren commented 2 years ago

How much do we need to win back and by when? Do the optimizations need to apply in one particular mode or another?

awlostowski-noaa commented 2 years ago

In terms of winning back - I suppose we need to think about performance relative to WRF hydro. I'll need to check with NCAR on how long it takes to run WRF channel only. I recall t-route being 10 - 20% slower than WRF, which means that this sort of optimization could be key.

If we focus optimizations on MC reach computations, then they would apply to all parallel modes. Also, the majority of any network is mc reaches.

EDIT

A 1-year CONUS channel-only WRF run takes 4.5 hours. A 1-year CONUS channel-only t-route simulation takes 22 hours. It appears that we are not even in the ball park in terms of competing with WRF. Hard to say how this time breaks down in terms of I/O v. compute. However, Any gain in performance is always useful.

groutr commented 2 years ago

@awlostowski-noaa https://numpy.org/doc/stable/reference/generated/numpy.require.html#numpy.require might be a nicer way to accomplish the result of your if-statement.

NOAA-OWP / t-route

speed up MC calculations by rearranging space/time sequence #526

EDIT