errors (and memory over-writing?) in gsw_geo_strf_dyn_height

efiring commented 7 years ago

The following test program is inspired by https://github.com/TEOS-10/GSW-Python/issues/21, in which erroneous results, and sometimes memory corruption, are appearing in the GSW-Python. The test case is very simple: uniform salinity and temperature, and uniform pressure increments. In the C test program below, supply the number of points, the minimum pressure, and the pressure increment as command-line arguments. Different combinations yield different oddities. Here is an example (on OS X):

(test) efiring@manini2:~/work/programs/TEOS-10/github/c_tests$ ./gsw_test 10 0 0.5
pressure, salinity, temp., dyn height
 0.00  35.0  10.0     0.000
 0.50  35.0  10.0    -0.006
 1.00  35.0  10.0    -0.012
 1.50  35.0  10.0    -0.018
 2.00  35.0  10.0    -0.024
 2.50  35.0  10.0    -0.030
 3.00  35.0  10.0    -0.036
 3.50  35.0  10.0    -0.043
 4.00  35.0  10.0    -0.049
 4.50  35.0  10.0    -0.049

--the end--

The anomaly here is that the last two dynamic heights are identical. Now, here is more obvious problem:

(test) efiring@manini2:~/work/programs/TEOS-10/github/c_tests$ ./gsw_test 10 1 0.1
pressure, salinity, temp., dyn height
 1.00  35.0  10.0    -0.012
 1.10  35.0  10.0    -0.013
 1.20  35.0  10.0    -0.015
 1.30  35.0  10.0    -4.630
 1.40  35.0  10.0    -4.630
 1.50  35.0  10.0    -4.630
 1.60  35.0  10.0    -4.630
 1.70  35.0  10.0    -4.224
 1.80  35.0  10.0    -4.630
 1.90  35.0  10.0    -4.630

And here is the test code:

#include <stdio.h>
#include <stdlib.h>
#include "gswteos-10.h"

int main(int argc, char **argv)
{
    double *sa, *ct, *p, *dh;
    int n;
    double p_ref=0.0;
    double min_p, del_p;
    int i;

    if (argc != 4)
    {
        printf("Usage: gsw_test np min_p del_p\n");
        printf("(number of pressures, minimum pressure, pressure increment)\n");
        exit(0);
    }
    n = atoi(argv[1]);
    min_p = atof(argv[2]);
    del_p = atof(argv[3]);

    sa = malloc(n*sizeof(double));
    ct = malloc(n*sizeof(double));
    p = malloc(n*sizeof(double));
    dh = malloc(n*sizeof(double));

    for (i=0; i<n; i++)
    {
        sa[i] = 35;
        ct[i] = 10;
        p[i] = i * del_p + min_p;
        dh[i] = 0.0/0.0;  /* NaN; ensure the initial value doesn't matter */
    }

    dh = gsw_geo_strf_dyn_height(sa, ct, p, p_ref, n, dh);

    if (dh == NULL)
    {
        printf("dh is NULL\n");
        exit(-1);
    }
    printf("pressure, salinity, temp., dyn height\n");
    for (i=0; i<n; i++)
    {
        printf("%5.2f  %4.1f  %4.1f   %7.3f\n", p[i], sa[i], ct[i], dh[i]);
    }
    printf("\n--the end--\n");

    free(sa);
    free(ct);
    free(p);
    free(dh);
    return 0;
}

hylandg commented 7 years ago

Eric,

I believe this problem stems from the setting of the variable max_dp_i in gsw_geo_strf_dyn_height (which controls the interpolation resolution) ...

/*
!--------------------------------------------------------------------------
!  This max_dp_i is the limit we choose for the evaluation of specific
!  volume in the pressure integration.  That is, the vertical integration
!  of specific volume with respect to pressure is perfomed with the pressure
!  increment being no more than max_dp_i (the default value being 1 dbar).
!--------------------------------------------------------------------------
*/
        max_dp_i = 1.0;

In your examples p ranges over only a few dbar (not even 1 dbar in the 2nd example) - I'm not an oceanographer so I can't comment on whether the examples are realistic or sensible. But I'm sure if you reduced the value of max_dp_i to 0.1 or even 0.01 then you would get a valid result.

Not sure if you could setup gsw_geo_strf_dyn_height to adjust max_dp_i dynamically, or perhaps pass in an optional resolution argument that is controlled by the user.

efiring commented 7 years ago

We could fiddle with max_dp_i, but I think this would be papering over a fundamentally buggy chunk of code. It is complex and hard to follow, and clearly incorrect in its results. Consider two more examples, with output edited for brevity:

(test) efiring@manini2:~/work/programs/TEOS-10/github/c_tests$ ./gsw_test 100 1 0.49
pressure, salinity, temp., dyn height
 1.00  35.0  10.0    -0.012
 1.49  35.0  10.0    -0.018
 1.98  35.0  10.0    -0.024
 2.47  35.0  10.0    -0.030
 ...
48.04  35.0  10.0    -0.586
48.53  35.0  10.0    -0.592
49.02  35.0  10.0    -0.598
49.51  35.0  10.0       nan

--the end--

So with 100 pts at 0.49 m spacing, the last point is not being set and thus retains the nan with which I initialized the output array.

Now change the spacing to 0.48 m:

(test) efiring@manini2:~/work/programs/TEOS-10/github/c_tests$ ./gsw_test 100 1 0.48
pressure, salinity, temp., dyn height
 1.00  35.0  10.0   -133269548249424573629195078951908830507382728504350874455861918032857911116381028352.000
 1.48  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
 1.96  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
 2.44  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
 2.92  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
 3.40  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
 3.88  35.0  10.0   -197238931409148353873638583244466784137397989455446116957874762065909119034222182400.000
...

And so on, all the way to the end. And I could provide other pathological cases, with different behavior.

I haven't tried, it, but yes, I'm sure that with a different value of max_dp_i the result would be different--but the algorithm is still broken. If this is supposed to support arbitrary "bottle" spacing, including duplicates and near-duplicates, then I doubt there is any sane way to handle max_dp_i such that we can have confidence in the results. In very simple and quick visual checking I haven't found a pathology with dp > 1, so perhaps all the errors occur with smaller dp.

One more example:

(test) efiring@manini2:~/work/programs/TEOS-10/github/c_tests$ ./gsw_test 50 1 0.4
pressure, salinity, temp., dyn height
 1.00  35.0  10.0    -0.012
 1.40  35.0  10.0    -0.017
 1.80  35.0  10.0    -0.022
 2.20  35.0  10.0    -0.027
 2.60  35.0  10.0    -0.032
 ...
15.80  35.0  10.0    -0.192
16.20  35.0  10.0    -0.197
16.60  35.0  10.0    -0.202
17.00  35.0  10.0    -0.207
17.40  35.0  10.0    -2.789
17.80  35.0  10.0    -2.789
18.20  35.0  10.0    -2.789
18.60  35.0  10.0    -2.789
19.00  35.0  10.0    -2.789
19.40  35.0  10.0    -2.789
19.80  35.0  10.0    -2.789
20.20  35.0  10.0    -2.789
20.60  35.0  10.0    -2.789

--the end--

I haven't thought it through, but my suspicion is that the calculation needs to be simplified, and broken down into smaller, independent, testable steps. Maybe a little less clever and efficient, but comprehensible and correct. It is fundamentally a very simple calculation: numerical integration of specific volume anomaly with respect to pressure. The complexity comes from mixing this with the logically separate step of interpolating, and possibly extrapolating, the inputs, so that 1) pressure starts at 0 (which actually is not necessary unless p_ref=0), and 2) the integral can be evaluated at p_ref (so the value there can be subtracted out). 3) the numerical integration itself can be done on a finer grid than the original, or semi-analytically if splines are used, for example.

I suspect the interpolation step should be done first and independently, selecting an algorithm suitable to the dataset at hand (including the no-interpolation option, appropriate for most CTD data sets). Then the numerical integration step can be simple.

It's ripe for a complete re-design. Easy to say, hard to do.

efiring commented 7 years ago

Running ./gsw_test 100 1 0.48 on linux with -g -O0 I get a crash dump, and running with valgrind I get an invalid write followed by a slew of invalid reads. At least one of the clever index calculations is broken.

efiring commented 7 years ago

Even the simplest case, with the most direct code path, gsw_test 100 0 1, is wrong; valgrind shows it using an uninitialized value, and I see where that is. But that's the tip of the iceberg.

efiring commented 7 years ago

I'm making some progress, but I have a question: @PaulMBarker, what is the rationale for allowing p_ref < p_min? In this case you have to extrapolate, which is of course more dangerous than interpolating. Wouldn't it make sense to require the user to do that explicitly, with whatever method the user chooses? In other words, treat p_ref < p_min the same as p_ref > p_max. Then in a common case, say a CTD cast that starts at 4 dbar, you leave it to the user to supply an array in which the missing 2 dbar and 0 dbar values are present if the user really wants DH relative to the surface--which I think is rare, anyway.

PaulMBarker commented 7 years ago

I agree extrapolation is dangerous and I avoid it were ever possible. In this case I have included a surface extrapolation by setting all interpolated bottles that are shallower than the shallowest observed bottle to be equal to the shallowest bottle. What I put in there is a default, if the user wants to supply their own values they can by interpolating before sending the data to the programme.

The bulk of GSW was written for my own use and where possible we have tried to include best practice. We can only hold a users hand so far, and if they want to be that silly and set the reference value to be the surface then the responsibility lies with them.

dankelley commented 7 years ago

For a test case, I modified GSW-R (in a private copy, not pushed to github) to use Eric's new function named gsw_geo_strf_dyn_height_1 and got the following for the test values in the docs:

> library(gsw)
> SA <- c(34.7118, 34.8915, 35.0256, 34.8472, 34.7366, 34.7324)
> CT <- c(28.8099, 28.4392, 22.7862, 10.2262,  6.8272,  4.3236)
> p <- c(      10,      50,     125,     250,     600,    1000)
> p_ref <- 1000
> dh <- gsw_geo_strf_dyn_height(SA, CT, p, p_ref)
> dh
[1] 17.22678019811 14.86748072258 11.19092193877  7.90345487941
[5]  3.60817474406  0.00000000000

PS. the expected results are as follows (12 digit setting for display).

[1] 17.03920455777 14.66585378472 10.91286113692  7.56792883877
[5]  3.39352405557  0.00000000000

efiring commented 6 years ago

This is largely addressed by #23 (which includes and expands upon #20); if that is accepted, the additional step to take at some point will be deleting or completely replacing the original gsw_geo_strf_dyn_height.

efiring commented 6 years ago

Closed by #23. Adding check values, and possibly removing gsw_geo_strf_dyn_height can be done in another PR.

TEOS-10 / GSW-C

errors (and memory over-writing?) in gsw_geo_strf_dyn_height #19