Add provenance information to netcdf-3 files

Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.

BSD 3-Clause "New" or "Revised" License

508 stars 262 forks source link

Add provenance information to netcdf-3 files #1272

Open DennisHeimbigner opened 5 years ago

DennisHeimbigner commented 5 years ago

re: https://github.com/Unidata/netcdf-c/issues/1263

Since there exist a number of netcdf-3 native writers for Java, python, etc. It is well past time to add provenance to netcdf-3.

Dave-Allured commented 5 years ago

I prefer no provenance attribute for netcdf-3. My objections are increased complexity, confusion for naive users, and breaks bit-identical reproducibility between netcdf library versions and netcdf API's.

The only value I see in provenance is aiding diagnosis of files of unknown pedigree. My only reply to this is, try harder to know where your files come from.

This is one person's opinion. Sorry to be cranky. I admit there is a tradeoff between the merits and objections.

DennisHeimbigner commented 5 years ago

Good arguments. One reason that nc3 provenance should not be needed is that the specification of the file format is pretty good with almost no ambiguities. So all nc3 generating software should conform to it. This is quite different from nc4, which is built on HDF5 and as we know HDF5 is not a fixed format: it changes over time. The original reason for this proposal was because we had a file that turns out to be malformed, but we wanted to track down the genesis of this file.

edhartnett commented 5 years ago

Good point about netcdf classic binary compatibility.

DennisHeimbigner commented 5 years ago

I am closing this issue since there seems to be no pressing need for it.

DennisHeimbigner commented 5 years ago

Ward and I were talking about this and a couple of things came up. First, we note that pnetcdf and CDF5 must figure into this because if the file is created via parallel io, then the pnetcdf library is used instead of the normal netcdf library. This is information that could be useful to know when examining a file.

Second, we realized that we did not know exactly what "bit-identical reproducibility " means exactly. What are some use cases? I do not believe that Unidata has ever promised bit-identical compatibiliity. It occurred to us that this issue might better be addressed by allowing nccopy or ncdump generate checksums on the contents of variables.

Dave-Allured commented 5 years ago

I use "bit-identical reproducible" to mean the same thing as "bit-for-bit reproducible" or "BFB reproducible" as used in these earlier discussions, and elsewhere. I assume this means that every last bit is identical when comparing two files. Do you prefer the BFB terminology?

https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2008/msg00002.html https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2014/msg00050.html

wkliao commented 5 years ago

Regarding to the bit-for-bit reproducibility, I think both NetCDF and PnetCDF do not fully support it. There are two issues: fill mode and padding.

NetCDF's default fill mode is NC_FILL, but allows NC_NOFILL. When fill mode is disabled, the unwritten values of a variable in the file are undefined.

Padding is defined in the classic file format specifications: paddings in file header must be NULL bytes, and paddings in data section must be the variable's fill values. NetCDF can be configured to enforce NULL-byte padding for file header, but its current implementation does not support fill-value padding for variables. Same for PnetCDF.

Of course, if all alignments were disabled (nc__enddef), padding is no longer an issue.