Closed willirath closed 7 years ago
For netcdf4/hdf5 files, there is the fletcher checksum feature of the hdf5 library that is exposed by the fletcher32
kwarg to createVariable
. Not sure if this achieves your goal though.
This topic seems to come up from time to time. I've tried to collect all existing similar attempts and added a demo of what I understand as content-based hashing here: https://github.com/willirath/netcdf-hash
Here, we should close.
TLDR: Hashing netCDF files based on the contained data (and physically meaningful) attributes would be nice.
Problem
I often have to deal with different versions of identical data and currently lack a way to efficiently version them or verify their integrity. A typical use case is raw ocean-model output in netCDF3 which is then converted to deflated netCDF4. The undeflated contents remain bitwise identical, while the container doesn't. Hence, hashing whole files is not a meaningful way to tell whether two files contain the same information.
Comparing files is solved
There are tools like cdo's diff or nccmp which handle bitwise comparison of two given netCDF files.
But hashing is not?
I think, however, that it is reasonable to do hash-based file verification.
Questions