LLNL / H5Z-ZFP

A registered ZFP compression plugin for HDF5
Other
50 stars 22 forks source link

fix reading of byte-swapped input files (#95) #101

Closed helmutg closed 1 year ago

helmutg commented 1 year ago

When reading a byte-swapped file, the input is grouped to 4-byte words and each of them is swapped individually. When we try to read such a file, we first validate its header using zfp_read_header with the ZFP_HEADER_MAGIC flag. This flag causes it to only validate the first word to be "zfp\x05". If it is not exactly that, it gives up. Unfortunately, this magic word can already be swapped. The actual byte swapping code would only be tried once the full header would fail to read, so automatic byte swapping never worked.

Instead, when encountering a header with bad magic, try swapping it already and only try reading the full header once the magic (normal or swapped) has been read successfully.

Thanks to Mark C. Miller, Peter Lindstrom and Enrico Zini for doing most of the debugging to get here.

markcmiller86 commented 1 year ago

Ok, sorry for long delay in assessing. I needed to fully understand why orig. implementation was failing and that new approach is best way to go.

What the filter is basically doing is using HDF5's cd_values stuff (an array of user-defined length of unsigned ints) to store a faux (or dummy) ZFP stream header. That is because that header captures everything the ZFP library needs to know about the compressed data. We use ZFP library to write its header into a buffer that is later treated by HDF5 as the dataset's cd_values array and we do the reverse when reading.

But, ZFP is expecting to be able to do that in an endian-agnostic way. In otherwords, you get the same sequence of bytes in the ZFP stream regardless of whether its done on a big-endian or little-endian machine. Thats fine but the cd_values array of unsigned ints is NOT endian-agnostic. It is designed to store little endian to the file but nonetheless handle byte swapping of the cd_values data (on both write and read) when interacting with a big endian caller.

So, ZFP's header will wind up experiencing byte swapping from HDF5 on big endian systems. The implementation was to trigger off a failure of ZFP to read its header as a hint that maybe the data got byte-swapped and then UNswap the data and try reading the header a second time. That worked fine until separate logic to read just ZFP's MAGIC was introduced (cd1de1c) but did not also handle the possible byte-swapping scenario.

Thanks so much to @helmutg for finding and fixing.