T-route MC solutions do not match bit-for-bit across parallel settings

awlostowski-noaa commented 2 years ago

By-subnetwork parallel simulations do not produce bit-for-bit identical results when compared to by-network parallel or serial simulations. Moreover, different answers are produced by changing subnetwork_target_size settings across by-subnetwork parallel computations. The lack of bit-by-bit parity across internal computational settings is concerning because 1) it points to a deeply rooted memory leakage and 2) we cannot optimize subnetwork_target_size for algorithm performance without suffering small changes in the final answer.

All of out testing points to the fact that this issue is somehow related sequencing computations amongst subnetworks. We can produce bit-for-bit matching between serial, by-network, and by-subnetwork simulations if we set the subnetwork_target_size to be so large that no subnetworks are actually created.

So far, all of our testing indicates that differences in routed flows are imperceptible when hydrographs are plotted. Additional testing is needed to see if this holds up at CONUS and regional scales.

Current behavior

Very small, floating point, differences between t-route solutions generated by by-subnetwork parallel and seral computations.
Errors only arise in reaches downstream of two "offnetwork-upstream" reaches.
At first, errors manifest as small differences in velocity and/or depth solutions at a single timestep and segment. Spatially and temporally subsequent solutions are then affected because on segment/timesteps erroneous solution becomes another's erroneous input.
We can completely mitigate the error by setting the subnetwork_target_size to be large enough that no subnetworks are created, making the computation sequence logically identical to serial or by-network. This is evidence that network breaking is somehow at the core of the issue.

Here are some logging results from inside of mc_reach.pyx. Here I am showing the inputs and returns from two calls to compute_reach_kernel, one by-network and one serial. The specific timestep and segment location of these results were cherry pick to isolate the first instance in a small example network of solutions diverging.


# the results below are from printf(%.8g) statements within mc_reach.pyx

************************************ 
***** BY-SUBNETWORK *****
************************************ 
***** INPUTS *****
q_lateral: 0.020014783
initial flow: 0.062396437
initial depth: 0.099372163
upstream flows: 0.045564443
previous upstream flows: 0.045564443
***** RETURNS *****
flow: 0.06365142
velocity: 0.21768187 <---------------
depth: 0.10055685    <---------------

==========================================

SERIAL

INPUTS q_lateral: 0.020014783 initial flow: 0.062396437 initial depth: 0.099372163 upstream flows: 0.045564443 previous upstream flows: 0.045564443 RETURNS flow: 0.06365142 velocity: 0.21768188 <---------------- depth: 0.10055684 <----------------



## Expected behavior
bit-for-bit matching solutions across all parallel schemes and configurations.

ping @groutr @hellkite500 @donaldwj

donaldwj commented 2 years ago

because of how floating point numbers work you can get slight changes like this if you change how they are accumulated.

For example

S1 = (F1 + ... + F1000)

will not always equal

S2 = (F1 + ... + F500) + (F501 + ... + F1000)

even thought the same 1000 floats are summed. I assume the creation of subnetworks changes what partial sums are calculated? If so this could be the cause of observed behavior.

donaldwj commented 2 years ago

In general you must determine a delta level for comparing floating point numbers where numbers are assumed to be equal if

S1 - S2 < delta

jameshalgren commented 2 years ago

@donaldwj Good comments. Algebraically equivalent computations are not necessarily floating point equivalent. The key concern here is that for (apparently) exactly the same inputs, the same deterministic algorithm is producing slightly different results from the same inputs. Notwithstanding the caveats mentioned here, I think one could expect that the algorithm should do exactly the same thing in each of these cases.

NOAA-OWP / t-route

T-route MC solutions do not match bit-for-bit across parallel settings #504

Current behavior