Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
511 stars 263 forks source link

Mandatory relaxed coordinate bound checking in 4.7.4? #1951

Closed ellenjohnson closed 2 years ago

ellenjohnson commented 3 years ago

software version: netcdf-c-4.7.4 OS: linux, gcc

Hi everyone! I just upgraded to 4.7.4 and see relaxed coordinate bound checking strikes again! I first noticed this with test failures in our test suite. I see you made the relaxed coordinate bound checking mandatory for all builds( https://github.com/Unidata/netcdf-c/issues/1519, https://github.com/Unidata/netcdf-c/pull/1561, and https://github.com/Unidata/netcdf-c/issues/1010).

We got around this when we upgraded from 4.6.1 to 4.7.3 by adding the flag -DENABLE_ZERO_LENGTH_COORD_BOUND=OFF when building netcdf. Ward recommended this, and it preserved our existing behavior in MATLAB – if users try to read data from an empty variable with an unlimited dimension, we catch the NC_EINVALCOORDS (-40) error, and we return an empty variable.

With 4.7.4 we no longer get the NC_EINVALCOORDS error, and return a 0x1 empty variable instead of a 0x0 empty variable. Looks like the build flag is no longer being honored in the CMake files. This is a compatibility issue for our users, as they may have legacy functions where they’re checking for an empty variable instead of a 0x1 empty variable. This happens whether we create a classic file or a netcdf-4 classic file.

Questions:

  1. I read through the comments in the related issues. Is coordinate bound checking only an issue if we use pnetCDF? Right now we don’t (but might in the future).
  2. Is there an easy way to still opt out of this (as we did with the build flag)? And even if so, what are the implications if we continue to opt out of the relaxed coordinate bound checking?
  3. I wrote a C program to verify nc_get_var_double is now returning zero status. But I'm puzzled it retrieves the empty variable as 0.0 – how does a user distinguish between whether the variable is indeed empty (no data was written to it), or whether it contains the actual 0.0 value that was added via nc_put_var?

Here's my C program. It creates a netcdf classic file with two dimensions: time (unlimited), and lon (fixed, size 10). It creates two variables: time (double) with one dimension (the unlimited time dimension), and x (int) with two dimensions (time and lon). I'm fuzzy on how to create the return variables for capturing output from reading from an empty variable with unlimited dimension, hopefully I got it right.

#include <stdio.h>
#include <string.h>
#include <netcdf.h>

#define FILENAME "testCoordBounds.nc"

int main()
{
    int ncid;
    int timeDimId;
    int lonDimId; 
    int timeVarId, xVarId, status;
    int timeRank = 1;
    int xRank = 2;
    int timeDims[timeRank]; // 1-dimension (rank = 1)
    int xDims[xRank]; // 2-dimensional (rank = 2)
    double timeVals[1];        
    int xVals[1][10];        

    // create netcdf classic file
    status = nc_create(FILENAME, NC_CLOBBER, &ncid);
    if(status != NC_NOERR) {
        printf("Could not create file.\n");
    }

    // define the time dimension (unlimited) 
    status = nc_def_dim(ncid, "time", NC_UNLIMITED, &timeDimId);
    if(status != NC_NOERR) {
        printf("Could not create timeDim.\n");
    }  

    // define the lon dimension (fixed) 
    status = nc_def_dim(ncid, "lon", 10, &lonDimId);
    if(status != NC_NOERR) {
        printf("Could not create lonDim.\n");
    }  

    // create time variable with one dimension - time, unlimited
    timeDims[0] = timeDimId;
    status = nc_def_var(ncid, "time", NC_DOUBLE, 1, timeDims, &timeVarId);
    if(status != NC_NOERR) {
        printf("Could not create timeVar.\n");
    }  

    // create the x variable with two dimensions - time (unlimited), and lon (fixed, 10) 
    xDims[0] = timeDimId;
    xDims[1] = lonDimId;

    status = nc_def_var(ncid, "x", NC_INT, 2, xDims, &xVarId);
    if(status != NC_NOERR) {
        printf("Could not create xVar.\n");
    }  

    // close the file 
    status = nc_close(ncid);
    printf("status after close = %d\n", status);

    // open file
    status = nc_open(FILENAME, NC_NOWRITE, &ncid);
    if(status != NC_NOERR) {
        printf("Could not open file.\n");
    }  

    // get the time variable id
    status = nc_inq_varid(ncid, "time", &timeVarId); 
    if(status != NC_NOERR) {
        printf("Could not get time id.\n");
    }  

    // get the x variable id
    status = nc_inq_varid(ncid, "x", &xVarId); 
    if(status != NC_NOERR) {
        printf("Could not get x id.\n");
    }  

    // get the time variable
    status = nc_get_var_double(ncid, timeVarId, &timeVals[0]); 
    if(status != NC_NOERR) {
        printf("Could not get time variable\n");
    }  
    printf("time variable = %f\n", timeVals[0]);

    // get the x variable
    status = nc_get_var_int(ncid, xVarId, &xVals[0][0]); 
    if(status != NC_NOERR) {
        printf("Could not get x variable\n");
    }  
    printf("x variable element 0 = %d\n", xVals[0]);
    printf("x variable element 1 = %d\n", xVals[1]);

    // close the file 
    status = nc_close(ncid);
    printf("status after close = %d\n", status);
    printf("End of test.\n\n");

    return 0;
}

And the output: status after close = 0 time variable = 0.000000 x variable element 0 = 1824139184 x variable element 1 = 1824139224 status after close = 0 End of test.

I'm also puzzled by why the value for the double variable (time) is 0.00, but the x variable (int) is garbage, unless that's a compiler thing.

Thank you!

edwardhartnett commented 3 years ago

THis will be more clear if you also provide an ncdump of the file.

Basically what seems to be the confusion here is the use of nc_get_var_double() instead of, for example, nc_get_vara_double(), in which you would see more clearly what is happening.

nc_get_var_double() gets the whole contents of the variable (i.e. all the data). But in the case of the NC_UNLIMITED data, what happens when you do an nc_get_var_double() before writing any data? There are zero records, so the nc_get_var_double() does nothing and returns NC_NOERR.

Similarly when you do nc_get_var_int() on the other variable. There is no data, so nothing in the file to read. Since you used nc_get_var_int(), netcdf-c tries to read all the data. It sees there is no data, so it does nothing and tells you it has read what you asked for, which is nothing. ;-)

The relaxation coord bound checking was a bug in the handling of parallel I/O (which may involve pnetcdf or HDF5.) When I originally wrote the code, I did not account for the fact that asking for zero items in a parallel I/O read is a perfectly valid read. This was fixed with an option in configure which turned the fix on and off for different builds, which was incorrect.

In your code above, what do you expect to happen that is not happening? If you were to try using nc_getvara() or nc_getvar1() would would get a more clear error, because instead of asking for "all the data" which could be no data, you are asking for a specific range of data. In this case, you will get an error because there is no data in the file.

Using nc_getvar*() functions with record data involves the trickiness of what happens when there is zero data records.

ellenjohnson commented 3 years ago

Hi Ed, thanks for your reply! Here's the ncdump output of the file:

ncdump testCoordBounds.nc
netcdf testCoordBounds {
dimensions:
        time = UNLIMITED ; // (0 currently)
        lon = 10 ;
variables:
        double time(time) ;
        int x(time, lon) ;
data:
}

I tried nc_get_vara_double as you suggested:

static size_t start[] = {0};
static size_t count[] = {1};
// stuff
status = nc_get_vara_double(ncid, timeVarId, start, count, &timeVals[0]);

And indeed I see the difference between the NC_NOERROR (0) when using nc_get_var_double and NC_EINVALCOORD (-40) with nc_get_vara_double.

=== get time variable using nc_get_var_double ===
0 return status after get time variable using nc_get_var_double!
time variable = 0.000000
=== get time variable using nc_get_vara_double ===
-40 return status after get time variable using nc_get_vara_double!
time variable = 0.000000

What you said about how nc_get_var_double returns NC_NOERR when there is no data to retrieve make sense. Plus you answered my question about what I expect to see happen by suggesting I compare the return status using var versus vara. (I was looking more at the value of the returned data of 0.0, but I guess there's no way to distinguish between an uninitialized double variable to hold the return data, versus retrieving a true value of 0.0 that was written to the file -- unless you examine the return value of the function). But I'm puzzled as to why getting a range of data (vara) returns the error -- there still is no data to retrieve (in the range), so there's nothing to read. Wouldn't you want to return NC_NOERR and return nothing as in the case of var*?

As for MATLAB, I'll update our code to handle the new mandatory relaxed coordinate bound feature, and I'll remove the CMake build flag -DENABLE_ZERO_LENGTH_COORD_BOUND=OFF since it's being ignored now during the build process.

Thanks!

edwardhartnett commented 2 years ago

I believe this issue can be closed.