Closed hellkite500 closed 10 months ago
Good catch. Q is hard-coded as an array of length 2500 in the Fortran code.
The C program for Topmodel includes:
fscanf(input_fptr,"%d %lf",nstep,dt);
where that line of the input file is:
950 1.0
Seems like a bug in the BMI init function. Q should not be allocated until we know nstep.
Seems like a bug in the BMI init function. Q should not be allocated until we know nstep.
Even then, there needs to be some validation of the relationship between nstep
, and the user defined values in distance_from_outlet
which derive the num_delay
and the num_time_dely_histo_ords
, otherwise an input could always be crafted which causes this buffer overflow.
Yes. I had a version that I was looking at the other day where I checked something like this:
if(num_time_delay_histo_ords>10) { printf("num_time_delay_histo_ords>10. Keith hard coded the maximum as 10. Change your input parameters and try again.\n"); exit(9); }
The best solution of course, would be to allocate the array *tch after calculating num_time_delay_histo_ords. Ben?
Looking at the code I see that the number of values provided for distance_from_outlet should be equal to the number of channels. (see here https://github.com/Ben-Choat/topmodel/blob/MemAlloc/src/topmodel.c#L602).
I've added a few lines to enforce this relationship and error out if it is not met. Next I plan on identifying all arrays where the array size is utlimately dependent on num_channels, and allocating those arrays within the refactored init() functions as the size becomes known.
@hellkite500 , looking here (https://github.com/Ben-Choat/topmodel/blob/MemAlloc/src/topmodel.c#L749) you can see that Q must be equal to int(num_time_delay_histo_ords) + int(num_delay). Both of those variables are floats and the loops are defined using <= num_delay (e.g.,). Q is assigned for however many ordinates are associated with num_delay, then additional values are assigned for however many ordinates are associated with num_time_delay_histo_ords.
Conceptually, num_delay captures the discharge delay related to the first or 'main' channel. num_time_delay_histo_ords captures the discharge delay associated with the other channels. It is quite possible (and I'm observing) that num_delay = 0 in instances where dist_from_outlet[1] is small relative to channel velocity.
Just documenting an observation. After enforcing that the number of values provided for dist_from_outlet is equal to num_channels I observed an interesting and related behavior. This behavior is also present under the unedited master brach. When the number of values provided for dist_from_outlet == num_channels, the program breaks when freeing Q in the FInalize function. However, if the number of values provided is larger or smaller than num_channels it completes as expected.
This keeps getting weirder:
To test this, I commented out several lines of code that assign values to Q[>1] and time_delay_histogram[>1]. I also commented out lines that calculated values such as a1, a2, and sumar, which also never appear to be used, other than for printing information about the time_delay_histogram, which is never used except for the first element.
Tests confirmed this to be the case. I’ve only tested on one set up, but tests suggest that indeed, at each time step, no value of Q except for Q[1] ever needs to be calculated. Same is true of time_delay_histogram, only time_delay_histogram[1] is ever used.
So, we could actually allocate Q as a 1 element array (or 2 if you count 0) and just never assign the extra unused values to it. This is very unsatisfying, but editing the code to include values of Q[>1] would fundamentally change output from topmodel which I believe is beyond the scope of what we want to do within NGEN.
I compared the code seen here (https://github.com/Ben-Choat/topmodel/blob/testQ/src/topmodel.c) with the master branch output.
Editing the code so that only the first element of Q (i.e., Q[1]) is used fixed the memory allocation issue related to Q and results match results from the Master branch. I can now edit the chv variable however I'd like without error.
It also corrected the error I documented a couple of comments up related to freeing Q in the Finalize() function.
This feels a bit hacky, but solves the issue. I'm wondering if any of the newer versions of topmodel corrected how Q is calculated. I see there was dynamic-topmodel release and a version for R.
I would think that after discharge for the next timestep is calculated, the leftover discharge from the previous timestep would be added to it, but this never occurs.
I'm starting to see what is going on here. Since the total water is temporally disaggregated based on the unit hydrograh, then for any given time step t
, you have to be able to account for t+num_ordinate
times so that the delayed water from t
is available at t+1
.
If you don't do this, then I think there is a mass balance bug here, as you observed. We will loose the delayed contribution at future time steps.
I think I can fix this appropriately, but I need to do a little sketching of the system and find the correct way to represent a single time step of Q as multiple time steps of Q approriately.
That is exactly how I'd interpet things as well. I'm wondering about the implications of changing output from topmodel. For example, if someone tested output from our corrected version with the uncorrected version which has been used for decades, the results for discharge would be different.
I'm happy to give it a go if we want to move in that direction.
A couple of notes/thoughts:
The following loop will cause out of bound writes on
Q
when initialized via BMI:https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L642
BMI allocates
Q
to holdnstep
doubles, which it sets tonstep=1
just proir to this allocation:https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/bmi_topmodel.c#L282
However, the topmodel
init
function iterates and writes toQ
num_delay
times, wherenum_delay
is determined by the value oftch
https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L577
which is computed from the
distance_from_outlet
and thechvdt
parameters.https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L564C34-L564C34
The
distance_from_outlet
is a dynamic input read from a filehttps://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L496C69-L496C69
The allocation of
Q
MUST be the same size asnum_delay
.Similarly, and perhaps contradictorily the next loop in
init
iterates overnum_time_delay_histo_ords
https://github.com/NOAA-OWP/topmodel/blob/9452b7b391a1ea7ad9e079ebde05135b12fe0527/src/topmodel.c#L647 and writes toQ
for each of these iterations, with a computed index offset of(*num_delay)+i;
This would imply that
num_time_delay_histo_ords <= num_delay
is a necessary (but perhaps not sufficient) condition, otherwise you would have the same potential overwrite problem. This relationship isn't clear, and definitely isn't validated and/or enforced, asnum_time_delay_histo_ords
is also computed fromtch
, which as noted above is a dynamic input read from an input file.So even if
Q
were allocated to benum_delay
in size, it would be quite possible to overwrite the buffer by crafting an input todist_from_channel
that allowednum_time_delay_histo_ords
to be sufficiently larger thannum_delay
.