Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
754 stars 262 forks source link

Content-based hashing of netCDF data #646

Closed willirath closed 7 years ago

willirath commented 7 years ago

TLDR: Hashing netCDF files based on the contained data (and physically meaningful) attributes would be nice.

Problem

I often have to deal with different versions of identical data and currently lack a way to efficiently version them or verify their integrity. A typical use case is raw ocean-model output in netCDF3 which is then converted to deflated netCDF4. The undeflated contents remain bitwise identical, while the container doesn't. Hence, hashing whole files is not a meaningful way to tell whether two files contain the same information.

Comparing files is solved

There are tools like cdo's diff or nccmp which handle bitwise comparison of two given netCDF files.

But hashing is not?

I think, however, that it is reasonable to do hash-based file verification.

Questions

jswhit commented 7 years ago

For netcdf4/hdf5 files, there is the fletcher checksum feature of the hdf5 library that is exposed by the fletcher32 kwarg to createVariable. Not sure if this achieves your goal though.

willirath commented 7 years ago

This topic seems to come up from time to time. I've tried to collect all existing similar attempts and added a demo of what I understand as content-based hashing here: https://github.com/willirath/netcdf-hash

Here, we should close.