Buffer overwrite in topmodel initialization

hellkite500 commented 1 year ago

The following loop will cause out of bound writes on Q when initialized via BMI:

https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L642

BMI allocates Q to hold nstep doubles, which it sets to nstep=1 just proir to this allocation:

https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/bmi_topmodel.c#L282

However, the topmodel init function iterates and writes to Q num_delay times, where num_delay is determined by the value of tch

https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L577

which is computed from the distance_from_outlet and the chvdt parameters.

https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L564C34-L564C34

The distance_from_outlet is a dynamic input read from a file

https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L496C69-L496C69

The allocation of Q MUST be the same size as num_delay.

Similarly, and perhaps contradictorily the next loop in init iterates over num_time_delay_histo_ords https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L647 and writes to Q for each of these iterations, with a computed index offset of (*num_delay)+i;

This would imply that num_time_delay_histo_ords <= num_delay is a necessary (but perhaps not sufficient) condition, otherwise you would have the same potential overwrite problem. This relationship isn't clear, and definitely isn't validated and/or enforced, as num_time_delay_histo_ords is also computed from tch, which as noted above is a dynamic input read from an input file.

So even if Q were allocated to be num_delay in size, it would be quite possible to overwrite the buffer by crafting an input to dist_from_channel that allowed num_time_delay_histo_ords to be sufficiently larger than num_delay.

fred-ogden commented 1 year ago

Good catch. Q is hard-coded as an array of length 2500 in the Fortran code.

fred-ogden commented 1 year ago

The C program for Topmodel includes:

fscanf(input_fptr,"%d %lf",nstep,dt);

where that line of the input file is:

950 1.0

Seems like a bug in the BMI init function. Q should not be allocated until we know nstep.

hellkite500 commented 1 year ago

Seems like a bug in the BMI init function. Q should not be allocated until we know nstep.

Even then, there needs to be some validation of the relationship between nstep, and the user defined values in distance_from_outlet which derive the num_delay and the num_time_dely_histo_ords, otherwise an input could always be crafted which causes this buffer overflow.

fred-ogden commented 1 year ago

Yes. I had a version that I was looking at the other day where I checked something like this:

if(num_time_delay_histo_ords>10) { printf("num_time_delay_histo_ords>10. Keith hard coded the maximum as 10. Change your input parameters and try again.\n"); exit(9); }

The best solution of course, would be to allocate the array *tch after calculating num_time_delay_histo_ords. Ben?

Ben-Choat commented 1 year ago

Looking at the code I see that the number of values provided for distance_from_outlet should be equal to the number of channels. (see here https://github.com/Ben-Choat/topmodel/blob/MemAlloc/src/topmodel.c#L602).

I've added a few lines to enforce this relationship and error out if it is not met. Next I plan on identifying all arrays where the array size is utlimately dependent on num_channels, and allocating those arrays within the refactored init() functions as the size becomes known.

Ben-Choat commented 1 year ago

@hellkite500 , looking here (https://github.com/Ben-Choat/topmodel/blob/MemAlloc/src/topmodel.c#L749) you can see that Q must be equal to int(num_time_delay_histo_ords) + int(num_delay). Both of those variables are floats and the loops are defined using <= num_delay (e.g.,). Q is assigned for however many ordinates are associated with num_delay, then additional values are assigned for however many ordinates are associated with num_time_delay_histo_ords.

Conceptually, num_delay captures the discharge delay related to the first or 'main' channel. num_time_delay_histo_ords captures the discharge delay associated with the other channels. It is quite possible (and I'm observing) that num_delay = 0 in instances where dist_from_outlet[1] is small relative to channel velocity.

Ben-Choat commented 1 year ago

Just documenting an observation. After enforcing that the number of values provided for dist_from_outlet is equal to num_channels I observed an interesting and related behavior. This behavior is also present under the unedited master brach. When the number of values provided for dist_from_outlet == num_channels, the program breaks when freeing Q in the FInalize function. However, if the number of values provided is larger or smaller than num_channels it completes as expected.

Ben-Choat commented 1 year ago

This keeps getting weirder:

At each timestep overland flow and baseflow are calculated and added together to get a total discharge for that time step (Qout = qb+qof)
That total discharge is spread through time using the time_delay_histogram.
Only the first element of the spread-out discharge is ever printed or used.
In the water balance, the total discharge (Qout), before time delay, is used.
There appears to be water left in the watershed at each timestep, and never accounted for.
I would think that after discharge for the next timestep is calculated, the leftover discharge from the previous timestep would be added to it, but this never occurs.
This is also the case in the original fortran code.
It seems this would result in a closed water balance since the total discharge (Qout) is used in the water balance calculation, but if you were to sum the discharges reported at each timestep (Q – which is used for comparing to Qobserved) it will not be a closed water balance, and the summed incremental discharge should be less than the total discharge used in the water balance.

To test this, I commented out several lines of code that assign values to Q[>1] and time_delay_histogram[>1]. I also commented out lines that calculated values such as a1, a2, and sumar, which also never appear to be used, other than for printing information about the time_delay_histogram, which is never used except for the first element.

Change so only Q[1] written in topmod() o Does not change results
Change so only Q[1] written in init() o Does not change results
In init() commenting out calculations of or assigning nonsensical values to time_delay_histogram[>1] and vars such as a1, sumar, a2 does not change results

Tests confirmed this to be the case. I’ve only tested on one set up, but tests suggest that indeed, at each time step, no value of Q except for Q[1] ever needs to be calculated. Same is true of time_delay_histogram, only time_delay_histogram[1] is ever used.

So, we could actually allocate Q as a 1 element array (or 2 if you count 0) and just never assign the extra unused values to it. This is very unsatisfying, but editing the code to include values of Q[>1] would fundamentally change output from topmodel which I believe is beyond the scope of what we want to do within NGEN.

I compared the code seen here (https://github.com/Ben-Choat/topmodel/blob/testQ/src/topmodel.c) with the master branch output.

Ben-Choat commented 1 year ago

Editing the code so that only the first element of Q (i.e., Q[1]) is used fixed the memory allocation issue related to Q and results match results from the Master branch. I can now edit the chv variable however I'd like without error.

It also corrected the error I documented a couple of comments up related to freeing Q in the Finalize() function.

This feels a bit hacky, but solves the issue. I'm wondering if any of the newer versions of topmodel corrected how Q is calculated. I see there was dynamic-topmodel release and a version for R.

hellkite500 commented 1 year ago

I would think that after discharge for the next timestep is calculated, the leftover discharge from the previous timestep would be added to it, but this never occurs.

I'm starting to see what is going on here. Since the total water is temporally disaggregated based on the unit hydrograh, then for any given time step t, you have to be able to account for t+num_ordinate times so that the delayed water from t is available at t+1.

If you don't do this, then I think there is a mass balance bug here, as you observed. We will loose the delayed contribution at future time steps.

I think I can fix this appropriately, but I need to do a little sketching of the system and find the correct way to represent a single time step of Q as multiple time steps of Q approriately.

Ben-Choat commented 1 year ago

That is exactly how I'd interpet things as well. I'm wondering about the implications of changing output from topmodel. For example, if someone tested output from our corrected version with the uncorrected version which has been used for decades, the results for discharge would be different.

I'm happy to give it a go if we want to move in that direction.

A couple of notes/thoughts:

The number of time_delay_ords does not change unless input parameters change (e.g., for calibration)
Allocating Q within the refactored init() functions could properly allocate Q for the new number of time_delay_ords depending on input parameters.

NOAA-OWP / topmodel

Buffer overwrite in topmodel initialization #36