CDF5 documentation lacking and a question

czender commented 7 years ago

A user applying NCO arithmetic to large data files in the "new" CDF5 format is encountering issues that result (silently) in bad answers. I'm trying to track down the problem. I have not found the netCDF documentation on CDF5 format limitations on this page and so one purpose of this issue to request the addition of CDF5 limits there (assuming that's the right place for it). The table also needs reformatting.

This PnetCDF page says

Amount of Individual Requests is Limited to 2GB: In PnetCDF, a single get/put request is limited to the amount of 2GiB....A solution for PnetCDF users is to break the request to multiple get/put calls so that each is less than amount of 2 GiB.

Forgetting about MPI, and only considering serial netCDF environments, does the 2GiB put/get request limit above apply to any netCDF file formats, including the CDF5 format? NCO does not limit its request sizes, so I wonder if that could be the problem...

DennisHeimbigner commented 7 years ago

Need to clarify a bit. Is a CDF5 file being accessed using NC_PNETCDF or by using NC_CDF5 mode flag?

czender commented 7 years ago

Via NC_CDF5 flag

wkliao commented 7 years ago

Do you mean "silent" the unexpected values appear in files for write and vise versa for read? And no error codes were returned...?

My understanding of NetCDF is it first finds the number of contiguous requests (in file layout) from arguments start and count, and then runs a loop to write one request at a time. I assume your case is to write a contiguous chunk of size > 2GB. I have never tried this before. It is likely some internal codes need to be adjusted to handle this kind of request.

czender commented 7 years ago

Yes, it would explain the problem my users are experiencing if requests for more than 2 GB from a CDF5 file do not return the requested amount. There is no problem doing this from a netCDF4 file. Again, this is just a hypothesis, I'm trying to track down a mysterious bug, and this could be it. Need confirmation.

czender commented 7 years ago

I have now replicated workflows that strongly suggest there are undocumented differences between put/get results for large (4-8 GB) variables between CDF5 and NETCDF4 files. I don't have proof that this is a netCDF issue rather than an NCO issue, though @wkliao suggests that CDF5 get/put in the netCDF library was never tested for requests > 2 GB (and PnetCDF explicitly does not support single requests > 2 GB). This issue prevents the DOE ACME MPAS model runs at high resolution (that are archived in CDF5 format) from being properly analyzed, so it affects users now. Is anyone (@wkliao ?) interested in verifying whether CDF5 has put/get limits?

My circumstantial methods and evidence are that a billion doubles all equal to one do average to one for NETCDF4 data, and don't for CDF5 data...

ncap2 -5 -v -O -s 'one=1;defdim("dmn_1e9",1000000000);dta_1e9[dmn_1e9]=1.0;two=2;' ~/nco/data/in.nc ~/foo_big.nc5
ncap2 -4 -v -O --cnk_dmn=dmn_1e9,100000000 -s 'one=1;defdim("dmn_1e9",1000000000);dta_1e9[dmn_1e9]=1.0;two=2' ~/nco/data/in.nc ~/foo_big.nc4
ncwa -O ~/foo_big.nc5 ~/foo_avg.nc5
ncwa -O ~/foo_big.nc4 ~/foo_avg.nc4
ncks -H ~/foo_avg.nc5
ncks -H ~/foo_avg.nc4

which results in

zender@skyglow:~$ ncks -H ~/foo_avg.nc5
netcdf foo_avg {

  variables:
    double dta_1e9 ;

    int one ;

    int two ;

  data:
    dta_1e9 = 0.536870912 ; 

    one = 1 ; 

    two = 2 ; 

} // group /
zender@skyglow:~$ ncks -H ~/foo_avg.nc4
netcdf foo_avg {

  variables:
    double dta_1e9 ;

    int one ;

    int two ;

  data:
    dta_1e9 = 1 ; 

    one = 1 ; 

    two = 2 ; 

} // group /

The first ncap2 command above requires the latest NCO snapshot to support CDF5. Note that these commands both place a few other variables, named "one" and "two", around the large variable "dta_1e9" in the output file, so dta_1e9 is not the only variable. When dta_1e9 is the only variable, the CDF5-based workflow yields the correct answer! So, if my hypothesis is correct, CDF5 variables larger than some threshold size (possibly 2 GB?) are not written and/or read correctly when the nc_put/get_var() is one request for the entire variable, and there are other variables in the dataset.

The behavior is identical on netCDF 4.4.1.1 and today's daily snapshot of 4.5.1-development.

DennisHeimbigner commented 7 years ago

Since the CDF5 code in libsrc came from pnetcdf originally, it would not be surprising if the 2gb limit snuck in. We are going to need a simple C program to test this. I tried but ran out of memory.

wkliao commented 7 years ago

I tested a short netcdf program to mimic the I/O @czender described (the codes are shown below). It creates 3 variables, namely var, one, and two. var is a 1D array of type double of 2^9 elements. Variables one and two are scalars. The 3 variables are first written to a new file and then read back to calculate the average. However, I could not reproduce the problem with this program.

@czender could you use this program and modify it close to what you are doing with the NCO operations?

#include <stdio.h>
#include <stdlib.h>
#include <netcdf.h>

#define DIM 1073741824

#define ERR {if(err!=NC_NOERR){printf("Error at line %d in %s: %s\n", __LINE__,__FILE__, nc_strerror(err));nerrs++;}}

int main(int argc, char *argv[])
{
    int err, nerrs=0, ncid, dimid, varid[3], int_buf1, int_buf2;
    size_t i;
    double *buf, avg=0.0;

    err = nc_create("test.nc", NC_CLOBBER|NC_CDF5, &ncid); ERR
    err = nc_def_dim(ncid, "dim", DIM, &dimid); ERR
    err = nc_def_var(ncid, "var", NC_DOUBLE, 1, &dimid, &varid[0]); ERR
    err = nc_def_var(ncid, "one", NC_INT, 0, NULL, &varid[1]); ERR
    err = nc_def_var(ncid, "two", NC_INT, 0, NULL, &varid[2]); ERR
    err = nc_set_fill(ncid, NC_NOFILL, NULL); ERR
    err = nc_enddef(ncid); ERR

    buf = (double*) malloc(DIM * sizeof(double));
    for (i=0; i<DIM; i++) buf[i] = 1.0;

    err = nc_put_var_double(ncid, varid[0], buf); ERR
    int_buf1 = 1;
    err = nc_put_var_int(ncid, varid[1], &int_buf1); ERR
    int_buf2 = 2;
    err = nc_put_var_int(ncid, varid[2], &int_buf2); ERR
    err = nc_close(ncid); ERR

    err = nc_open("test.nc", NC_NOWRITE, &ncid); ERR
    err = nc_inq_varid(ncid, "var", &varid[0]); ERR
    err = nc_inq_varid(ncid, "one", &varid[1]); ERR
    err = nc_inq_varid(ncid, "two", &varid[2]); ERR
    for (i=0; i<DIM; i++) buf[i] = 0.0;
    err = nc_get_var_double(ncid, varid[0], buf); ERR
    int_buf1 = int_buf2 = 0;
    err = nc_get_var_int(ncid, varid[1], &int_buf1); ERR
    err = nc_get_var_int(ncid, varid[2], &int_buf2); ERR
    err = nc_close(ncid); ERR

    printf("get var one = %d\n",int_buf1);
    printf("get var two = %d\n",int_buf2);
    for (i=0; i<DIM; i++) avg += buf[i];
    avg /= DIM;
    printf("avg = %f\n",avg);
    free(buf);

    return (nerrs > 0);
}

% ./tst_cdf5 
get var one = 1
get var two = 2
avg = 1.000000

% ls -lh
total 8.1G
-rw------- 1 wkliao users 2.2K Aug 21 16:16 Makefile
-rw-r--r-- 1 wkliao users 8.1G Aug 21 16:29 test.nc
-rwxr-xr-x 1 wkliao users 1.1M Aug 21 16:27 tst_cdf5
-rw------- 1 wkliao users 1.8K Aug 21 16:27 tst_cdf5.c

% ncdump -h test.nc
netcdf test {
dimensions:
    dim = 1073741824 ;
variables:
    double var(dim) ;
    int one ;
    int two ;
}

czender commented 7 years ago

Thanks @wkliao. This is a good starting point. I made a few changes to more closely follow the NCO code path. However, I still get the same results you do. Will keep trying...

czender commented 7 years ago

@wkliao I notice something I do not understand about CDF5: I have a netCDF4 file with two variables whose on-disk sizes compute as 9 GB and 3 GB, respectively. By "computes as" I mean multiplying the dimension sizes times the size of NC_DOUBLE. And the netCDF4 (uncompressed) filesize is, indeed, 12 GB, as I expect. Yet when I convert that netCDF4 file to CDF5, the total size changes from 12 GB to 9 GB. In other words, it looks like CDF5 makes use of compression, or does not allocate wasted space for _FillValues, or something like that. If you understand what I'm talking about, please explain why CDF5 consumes less filespace than expected...

wkliao commented 7 years ago

The file size should be 12GB. Do you have a test program that can reproduce this?

czender commented 7 years ago

Uh oh. It's hard to wrap my head around all this. The mysterious issues only appear with huge files that are hard to manipulated. The issues when processing with NCO may be due to NCO, but in order to verify I have to use another toolkit. CDO does not yet support CDF5. And nccopy either fails (with -V) to extract the variables I want, or (with -v) extracts all the variables so I can't mimic the NCO workflow...

zender@skyglow:~$ nccopy -V one_dmn_rec_var,two_dmn_rec_var ~/nco/data/in.nc ~/foo.nc
NetCDF: Variable not found
Location: file nccopy.c; line 979
zender@skyglow:~$ nccopy -v one_dmn_rec_var,two_dmn_rec_var ~/nco/data/in.nc ~/foo.nc
zender@skyglow:~$ ncdump ~/foo.nc | m
netcdf foo {
dimensions:
        dgn = 1 ;
        bnd = 2 ;
        lat = 2 ;
        lat_grd = 3 ;
        lev = 3 ;
        rlev = 3 ;
        ilev = 4 ;
        lon = 4 ;
        lon_grd = 5 ;
        char_dmn_lng80 = 80 ;
        char_dmn_lng26 = 26 ;
        char_dmn_lng04 = 4 ;
        date_dmn = 5 ;
        fl_dmn = 3 ;
        lsmlev = 6 ;
        wvl = 2 ;
        time_udunits = 3 ;
        lon_T42 = 128 ;
        lat_T42 = 64 ;
        lat_times_lon = 8 ;
        gds_crd = 8 ;
        gds_ncd = 8 ;
        vrt_nbr = 2 ;
        lon_cal = 10 ;
        lat_cal = 10 ;
        Lon = 4 ;
        Lat = 2 ;
        time = UNLIMITED ; // (10 currently)
variables:
        int date_int(date_dmn) ;
                date_int:long_name = "Date (as array of ints: YYYY,MM,DD,HH,MM)" ;
        float dgn(dgn) ;
                dgn:long_name = "degenerate coordinate (dgn means degenerate, i.e., of size 1)" ;
...

Until I find another way to subset variables from a CDF5 file, I'm stuck. @WardF would you please instruct me how to use nccopy to subset certain variables from a CDF5 file? I think the above output demonstrates that nccopy (yes, the latest 4.5.x) has some breakage with the -v and -V options.

WardF commented 7 years ago

I will take a look at -v/-V and see what's going on. The original files are obviously quite large, I'll see if I can recreate this locally with a file on hand.

czender commented 7 years ago

The in.nc file on which the above commands were performed to demonstrate nccopy -v/-V weirdness, is tiny. Same behavior should occur with any file you like.

WardF commented 7 years ago

@DennisHeimbigner This feels like an issue we've seen and (I thought) addressed, recently. Does this ring any bells for you? Maybe the fix is on a branch that I neglected to merge. Going to look now.

WardF commented 7 years ago

Ok. Similar issue, although the issue claims it is 64-bit offset only and this is not the case. I'll update the original issue.

https://github.com/Unidata/netcdf-c/issues/425

I can copy in.nc (via nccopy) from classic to netcdf-4 classic; the commands Charlie outline above work on this new file, but fail on the old one.

WardF commented 7 years ago

Found a code stanza in nccopy.c starting at line 1451. The comment seems of interest here.

/* For performance, special case netCDF-3 input or output file with record
     * variables, to copy a record-at-a-time instead of a
     * variable-at-a-time. */
    /* TODO: check that these special cases work with -v option */
    if(nc3_special_case(igrp, inkind)) {
    size_t nfixed_vars, nrec_vars;
    int *fixed_varids;
    int *rec_varids;
    NC_CHECK(classify_vars(igrp, &nfixed_vars, &fixed_varids, &nrec_vars, &rec_varids));
    NC_CHECK(copy_fixed_size_data(igrp, ogrp, nfixed_vars, fixed_varids)); //FAILURE IS DOWNSTREAM FROM HERE
    NC_CHECK(copy_record_data(igrp, ogrp, nrec_vars, rec_varids));
    } else if (nc3_special_case(ogrp, outkind)) {

DennisHeimbigner commented 7 years ago

Have you tried to disable this optimization to see if it then starts working ok?

czender commented 7 years ago

@wkliao This 9 GB CDF5 file contains two variables whose uncompressed sizes are 9 GB and 3 GB and so should require 12 GB of disk space to store. Inspection with ncdump/ncks shows there are data in both variables. When I convert it to netCDF4, the resulting file is, indeed, 12 GB. Can you tell me anything about whether the CDF5 file is legal, or corrupt, or when/where/how in the writing process it may have been truncated?

WardF commented 7 years ago

@DennisHeimbigner Yes I found the issue, it is unrelated to optimization (in terms of the nccopy issue, not what @czender has observed with file sizes). I'm working on a fix right now.

WardF commented 7 years ago

Ok, I think I have a fix for the nccopy -V/-v issue @czender references above. It is in the branch gh425. I have not run any regression tests yet so I can't say it's the fix, but I will pick it up tomorrow and continue. I can say that nccopy -V/-v is working as expected for this specific test case, however. Pushing out to github, will see what travis says.

wkliao commented 7 years ago

Regarding to file size issue, firstly all classical formats, including CDF-5, do no compression. One possibility for what you encountered is when this CDF-5 file was created, not all elements of the second variable were written and the fill mode was turned off (If the file was created by a PnetCDF program, please note the fill mode is off by default in PnetCDF). In this case, because the second variable was not fully written, the file size can be less than 12GB.

There is a PnetCDF utility program called ncoffsets. Command "ncoffsets file.nc" prints the starting and ending file offsets of individual variables defined in a classical file. When used for the above CDF-5 file, it should print the ending offset 12GB for the second variable, but command "ls -l" can still show a number less than that. Please give it a try and let me know.

czender commented 7 years ago

@WardF please let me know when the nccopy fixes are in master so I can check whether nccopy and NCO give the same answers when subsetting huge CDF5 files.

WardF commented 7 years ago

@czender It is checked into master; I added a test based on the information you provided that failed before my fix and is passing now. Please let me know if you still encounter the issue!

czender commented 7 years ago

@WardF thank you for fixing nccopy. It works for me. And it behaves identically to NCO when subsetting large CDF5 files. Both NCO (all operators) and nccopy silently truncate CDF5 data. There are, to my knowledge, no other command-line subsetting utilities to test (since CDO does not yet support CDF5), though my prediction is that they too would truncate CDF5 files while subsetting. It sure seems like a netCDF library issue...unless nccopy and NCO both independently made the same programming error. This 57 GB CDF5 file contains two variables (mentioned way up in this thread) whose sizes are 9 GB and 3 GB so their total size is 12 GB. When subset with NCO or nccopy the resulting total filesize is only 9 GB. Some data from both variables is written, but clearly at least 25% of the total is not. @wkliao and I have been unable to reproduce this behavior with a simple C program, although as shown above I can easily demonstrate incorrect results when using ncap2 to create relatively simple CDF5 files in that size range. Once someone fixes nccopy or libnetcdf so that nccopy works please let me know, and I'll test whether the same fix works in NCO.

zender@skyglow:~$ nccopy -k 4 -V timeMonthly_avg_normalVelocity,timeMonthly_avg_vertVelocityTop mpaso.hist.am.timeSeriesStatsMonthly.0054-01-01.nc foo.nc4
zender@skyglow:~$ ls -l foo.nc4
-rw-r--r--. 1 zender zender 11922672598 Sep  5 16:51 foo.nc4
zender@skyglow:~$ nccopy -k 5 -V timeMonthly_avg_normalVelocity,timeMonthly_avg_vertVelocityTop mpaso.hist.am.timeSeriesStatsMonthly.0054-01-01.nc foo.nc5
zender@skyglow:~$ ls -l foo.nc5
-rw-r--r--. 1 zender zender 8908582756 Sep  5 16:53 foo.nc5

p.s. Sorry for the large test case, smaller ones have not been too helpful at reproducing the problem. I am at the CF2 workshop in Boulder until Friday, in case you want to say hi or grab a beer.

WardF commented 7 years ago

No problem, thanks @czender. I will investigate the cdf5/nccopy issue but in the meantime I need to get the rc3 out. There's a chance this issue may not be fixed for the final release, but we will document it if need be. I'm encouraged that a straight C program isn't causing an error. @wkliao have you tried to see if this behavior persists when copying CDF5 files generated by pnetcdf instead of the regular libnetcdf?

I'd prefer to have this fixed ahead of time but this release has languished for a while now and really needs to go out.

WardF commented 7 years ago

*really needs to go out, although of course not with show-stopping bugs. It is a harder balancing act than I expected, I'm finding, between "I need to fix all the bugs before there's a release" and "Known bugs don't matter".

WardF commented 7 years ago

@czender

p.s. Sorry for the large test case, smaller ones have not been too helpful at reproducing the problem. I am at the CF2 workshop in Boulder until Friday, in case you want to say hi or grab a beer.

I will be in the office tomorrow and would love to say hi; we are having our netcdf meeting tomorrow if you'd care to switch gears from CF2, briefly.

czender commented 7 years ago

FWIW, the 57 GB test file was generated by PnetCDF (as its global metadata makes clear). Pinging @wkliao. The test file appears to be of the correct size to match the variables/dimensions it contains (I added a feature in the latest NCO snapshot that prints the expected RAM/disk size of uncompressed data in a file, accounting for subsets and hyperslabs). Anyway, the test file is the correct size, but the two-variable subset (created with nccopy or NCO) of that file is 25% smaller than expected. It is also true that I can generate apparently corrupt CDF5 files directly through the netCDF interface (without PnetCDF) using ncap2. Hence there is no evidence that suggests pnetcdf is/was at fault. It is only the netCDF access to this data that acts strangely.

What time and where is the netCDF meeting?

WardF commented 7 years ago

The time is TBD; it is usually scheduled at 1 but it is unclear if I can make it. We are trying to reschedule for 10 but waiting to hear from upstream dependencies. It is in the small meeting room at Unidata. I'll keep you in the loop when it is hammered out.

wkliao commented 7 years ago

PR #478 should fix the problem. The PnetCDF utility ncoffsets detects the wrong variable sizes and offsets which helps me find the bug. This PR passed the nccopy commands from @czender and file size appears correct now. I also ran make check and all tests passed.

czender commented 7 years ago

Thank you for fixing the bug @wkliao. Please characterize the datasets which it affects, and whether reading, writing, or both are affected. When can users expect silently truncated subsetted CDF5 files? Variables of what size? Record variables? All variables? Is there a workaround? Until netCDF 4.5.x is released with this patch, what do we recommend with respect to generating or using CDF5 files? For example, the MPAS users spent a lot of their supercomputer allocation producing files that, given the inertia in getting new software installed at NCAR, NERSC, ALCF, etc, NCO will not be able to analyze for a few months. Should they switch future PIO-based simulations to use netCDF4? Is there a simple workaround that NCO can implement that will work with CDF5 on netCDF 4.4.x? Both users and developers need guidance.

WardF commented 7 years ago

@wkliao I'm interested in this information as well, so that we can include it in the 4.5.0 release documentation.

wkliao commented 7 years ago

Hi, Charlie,

I have been thinking about your questions these days. I am not sure I am in the right position to answer the questions. Let me describe the cause of this bug first and maybe NetCDF folks have good suggestions.

The bug is caused by re-using the codes checking for variable's size against the max allowed for CDF-1 and 2 formats on CDF-5 files. For CDF-1 and 2, there can only be at most one large fix-sized variable and it has to be the last variable defined. For record variables, the same limit applies to the single record of a variable. Because CDF-5 lifts this limit, the checking codes mess up the calculation of starting file offsets for all variables defined after the large variable. Large variables here means their size each is > 2^31-3 bytes for CDF-1 and 2^32-3 bytes for CDF-2. See NetCDF Format Limitations

I believe this bug affects writes only, but as suggest by @DennisHeimbigner we need a test program to check.

If NCO can detects the above against NetCDF version number, then I suggest it to return an error message requiring version 4.5 and later. Or maybe NCO can create more than one file and make sure there is no more than one large variable per file if 4.4.x and priors are used.

Note the CDF-5 files produced by PnetCDF are not affected, so the files created by applications that use PnetCDF directly or NetCDF through PnetCDF, (Is MPAS one of them?) will be valid CDF-5 files. However, they just cannot be fed to NetCDF 4.4.x to create another CDF-5 file, such as using nccopy. If NCO makes calls to NetCDF C APIs, NCO can use NC_MPIIO/NC_PNETCDF flag to create a new CDF-5 file that will use PnetCDF driver underneath.

czender commented 7 years ago

Thanks for the info @wkliao. I hope you continue to audit the CDF5 code because I have seen behavior that seems consistent with reads being affected (as well as writes). In particular, I have used NCO to create MD5 hashes of the same data in a valid CDF5 file (created by PnetCDF), and in an invalid CDF5 file (truncated as above). The MD5 hashes are the same. I can't explain how this could be true, though it seems to indicate there could be a bug in reading MD5 files. You can check this yourself with commands like ncks -D 2 --trd --md5_dgs -v timeMonthly_avg_normalVelocity,timeMonthly_avg_vertVelocityTop mpaso.hist.am.timeSeriesStatsMonthly.0054-01-01.nc | grep MD5 Unfortunately I'm not confident that using NC_MPIIO/NC_PNETCDF would help. NCO requests the entire variable at one time. If the variable size exceeds 2 GB, wouldn't that violate the PnetCDF request limit?

wkliao commented 7 years ago

FYI. I wrote a short program to extract variable timeMonthly_avg_normalVelocity from a file to a new file. Command "diff" on the two new files (extracted from valid and invalid CDF5) reports they are different.

wkliao commented 7 years ago

My short program was built with the latest v4.5.0-release-branch. commit 15263f7

czender commented 7 years ago

Here is a test program which is simple to modify to output in any netCDF format. CDF5 fails and netCDF4 succeeds. Note that the first CDF5 variable fails, while the second succeeds. One should probably also test that record variables work with any patch...

zender@aerosol:~$ cdf5 # NC_CDF5
total1 = 144115119087943425, expected1 = 576460751766552576
total2 = 576460751766552576, expected2 = 576460751766552576
avg1 = 134217663.750000
avg2 = 536870911.500000
zender@aerosol:~$ cdf5 # NC_NETCDF4
total1 = 576460751766552576, expected1 = 576460751766552576
total2 = 576460751766552576, expected2 = 576460751766552576
avg1 = 536870911.500000
avg2 = 536870911.500000

File cdf5.c:

// Purpose: Test behavior of large CDF5 files
// 20170821: Original test by Wei-keng Liou 
// 20170909: Rewritten to expose CDF5 bug by Charlie Zender

// gcc -std=c99 -I/opt/local/include -o ~/bin/cdf5 ~/sw/c/cdf5.c -L/opt/local/lib -lnetcdf -lhdf5_hl -lhdf5 -lcurl

#include <stdio.h>
#include <stdlib.h>
#include <netcdf.h>

#define DIM 1073741824

#define ERR {if(err!=NC_NOERR){printf("Error at line %d in %s: %s\n", __LINE__,__FILE__, nc_strerror(err));nerrs++;}}

int main(int argc, char *argv[])
{
  int err, nerrs=0, ncid, dimid, varid[2];
  size_t i;
  long long *buf1,*buf2;
  long long avg1,avg2;
  avg1=avg2=0LL;

  //err = nc_create("cdf5.nc", NC_CLOBBER|NC_CDF5, &ncid); ERR;
  //err = nc_create("cdf5.nc", NC_CLOBBER|NC_64BIT_DATA, &ncid); ERR;
  err = nc_create("cdf5.nc", NC_CLOBBER|NC_NETCDF4, &ncid); ERR;
  err = nc_def_dim(ncid, "dim", DIM, &dimid); ERR;
  // csz 20170830 test record dimension
  // err = nc_def_dim(ncid, "dim", NC_UNLIMITED, &dimid); ERR;
  err = nc_def_var(ncid, "var1", NC_INT64, 1, &dimid, &varid[0]); ERR
  err = nc_def_var(ncid, "var2", NC_INT64, 1, &dimid, &varid[1]); ERR
  err = nc_set_fill(ncid, NC_NOFILL, NULL); ERR
  err = nc_enddef(ncid); ERR

  buf1 = (long long *) malloc(DIM * sizeof(long long));
  buf2 = (long long *) malloc(DIM * sizeof(long long));
  // 20170831 Write index-dependent values into array so truncation does not yield false-negative answer
  for (i=0; i<DIM; i++) buf1[i] = buf2[i] = i;
  err = nc_put_var_longlong(ncid, varid[0], buf1); ERR;
  err = nc_put_var_longlong(ncid, varid[1], buf2); ERR;

  err = nc_close(ncid); ERR

  err = nc_open("cdf5.nc", NC_NOWRITE, &ncid); ERR
  err = nc_inq_varid(ncid, "var1", &varid[0]); ERR
  err = nc_inq_varid(ncid, "var2", &varid[1]); ERR
  for (i=0; i<DIM; i++) buf1[i] = buf2[i] = 0LL;;
  err = nc_get_var_longlong(ncid, varid[0], buf1); ERR
  err = nc_get_var_longlong(ncid, varid[1], buf2); ERR
  err = nc_close(ncid); ERR

  for (i=0; i<DIM; i++) avg1 += buf1[i];
  for (i=0; i<DIM; i++) avg2 += buf2[i];
  printf("total1 = %lld, expected1 = %lld\n",avg1,(DIM-1LL)*DIM/2LL);
  printf("total2 = %lld, expected2 = %lld\n",avg2,(DIM-1LL)*DIM/2LL);
  printf("avg1 = %f\n",avg1*1.0/DIM);
  printf("avg2 = %f\n",avg2*1.0/DIM);
  free(buf1);
  free(buf2);

  return (nerrs > 0);
}

DennisHeimbigner commented 7 years ago

Charlie- did you post somewhere the original -- and failing-- subsetting command of the 57gig pnetcdf file?

czender commented 7 years ago

Yes, as per above:

zender@skyglow:~$ nccopy -k 4 -V timeMonthly_avg_normalVelocity,timeMonthly_avg_vertVelocityTop mpaso.hist.am.timeSeriesStatsMonthly.0054-01-01.nc foo.nc4
zender@skyglow:~$ ls -l foo.nc4
-rw-r--r--. 1 zender zender 11922672598 Sep  5 16:51 foo.nc4
zender@skyglow:~$ nccopy -k 5 -V timeMonthly_avg_normalVelocity,timeMonthly_avg_vertVelocityTop mpaso.hist.am.timeSeriesStatsMonthly.0054-01-01.nc foo.nc5
zender@skyglow:~$ ls -l foo.nc5
-rw-r--r--. 1 zender zender 8908582756 Sep  5 16:53 foo.nc5

wkliao commented 7 years ago

I just found out branch v4.5.0-release-branch does not include my fixes in #478. After merging #478 to the master, the test program ran fine with expected results.

WardF commented 7 years ago

@wkliao The main issue now is testing that your fix works across multiple platforms and formats (32/64/Windows/ARM, etc) and (64bit offset, etc). I'm looking at that now, thanks all for your help with this!

czender commented 7 years ago

@wkliao need to know if you think intercepting nc_putvar?*() to split single CDF5 write requests for data buffers larger than N into multiple write requests of buffers smaller than N will avoid this bug. And, if so, what is N? And do you still think that CDF5 reads are not affected?

WardF commented 7 years ago

When running python tests against libnetcdf built the patch from @wkliao, I see the following (on 64-bit systems only).

This will need to be sorted out before merging this fix in or saying that it 'fixes' the problem.

netcdf4-python version: 1.3.0
HDF5 lib version:       1.8.19
netcdf lib version:     4.5.1-development
numpy version           1.11.0
...............................foo_bar
.http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/testData.nc => /tmp/occookieKmOOvt
..............................................python: /home/tester/netcdf-c/libsrc/nc3internal.c:794: NC_endef: Assertion `ncp->begin_rec >= ncp->old->begin_rec' failed.

It's possible it is a problem with the python test; I'll ask our in-house python guys and see what they say :)

WardF commented 7 years ago

I've determined the python test which is failing is tst_cdf5.py.

WardF commented 7 years ago

The test is as follows; does anything leap out?

 def setUp(self):
        self.netcdf_file = FILE_NAME
        nc = Dataset(self.netcdf_file,'w',format='NETCDF3_64BIT_DATA')
        # create a 64-bit dimension
        d = nc.createDimension('dim',dimsize) # 64-bit dimension
        # create an 8-bit unsigned integer variable
        v = nc.createVariable('var',np.uint8,'dim')
        v[:ndim] = arrdata
        nc.close()

wkliao commented 7 years ago

@czender

The bug appears when defining more than one large variables in a new file. So split a large put request to smaller ones will not fix the bug.

If you are developing a workaround in NCO, then I suggest to check the number of large variables and create a new file that contains only one large variable and make sure the large variable is defined the last.

I still believe the bug affects writes only, as the fixes I developed are in the subroutines only called by the file header writer. However, it is better to have a test program to check.

czender commented 7 years ago

@wklian does "large" in your message above mean 2 GiB or 4 GiB or ...?

wkliao commented 7 years ago

It is mentioned on one of my previous posts. Here it is copy and pasted.

Large variables here means their size each is > 2^31-3 bytes for CDF-1 and 2^32-3 bytes for CDF-2. See NetCDF Format Limitations

czender commented 7 years ago

I don't understand. I'm talking about writing a CDF5 file with netCDF 4.4.x. Not CDF1 or CDF2. What is the largest variable I can safely write as the last variable in a CDF5 file?

wkliao commented 7 years ago

Sorry. Let me re-phrase, when using netCDF 4.4.x to create a new CDF-5 file, the file can only contain one large variable at most and it must be defined last. The large variable is of size > 2^32-3 bytes.

To be honest, I really do not recommend a workaround for netCDF 4.4.x, because the above suggestion has never fully been tested. This suggestion is based on my understanding to the root of the bug.

Unidata / netcdf-c

CDF5 documentation lacking and a question #463