NOAA-OWP / t-route

Tree based hydrologic and hydraulic routing
Other
43 stars 48 forks source link

T-route MC solutions do not match bit-for-bit across parallel settings #504

Open awlostowski-noaa opened 2 years ago

awlostowski-noaa commented 2 years ago

By-subnetwork parallel simulations do not produce bit-for-bit identical results when compared to by-network parallel or serial simulations. Moreover, different answers are produced by changing subnetwork_target_size settings across by-subnetwork parallel computations. The lack of bit-by-bit parity across internal computational settings is concerning because 1) it points to a deeply rooted memory leakage and 2) we cannot optimize subnetwork_target_size for algorithm performance without suffering small changes in the final answer.

All of out testing points to the fact that this issue is somehow related sequencing computations amongst subnetworks. We can produce bit-for-bit matching between serial, by-network, and by-subnetwork simulations if we set the subnetwork_target_size to be so large that no subnetworks are actually created.

So far, all of our testing indicates that differences in routed flows are imperceptible when hydrographs are plotted. Additional testing is needed to see if this holds up at CONUS and regional scales.

Current behavior

==========================================


SERIAL


INPUTS q_lateral: 0.020014783 initial flow: 0.062396437 initial depth: 0.099372163 upstream flows: 0.045564443 previous upstream flows: 0.045564443 RETURNS flow: 0.06365142 velocity: 0.21768188 <---------------- depth: 0.10055684 <----------------



## Expected behavior
bit-for-bit matching solutions across all parallel schemes and configurations.

ping @groutr @hellkite500 @donaldwj 
donaldwj commented 2 years ago

because of how floating point numbers work you can get slight changes like this if you change how they are accumulated.

For example

S1 = (F1 + ... + F1000)

will not always equal

S2 = (F1 + ... + F500) + (F501 + ... + F1000)

even thought the same 1000 floats are summed. I assume the creation of subnetworks changes what partial sums are calculated? If so this could be the cause of observed behavior.

donaldwj commented 2 years ago

In general you must determine a delta level for comparing floating point numbers where numbers are assumed to be equal if

S1 - S2 < delta

jameshalgren commented 2 years ago

@donaldwj Good comments. Algebraically equivalent computations are not necessarily floating point equivalent. The key concern here is that for (apparently) exactly the same inputs, the same deterministic algorithm is producing slightly different results from the same inputs. Notwithstanding the caveats mentioned here, I think one could expect that the algorithm should do exactly the same thing in each of these cases.