Added new endianness test

spanezz commented 1 year ago

I added to tests/ a bigendian.h5 file generated on a s390x big endian machine using:

yes 12345 | head -n 5000 | h5import /dev/stdin -dims 5000 -type TEXTIN -size 32 -o unpacked.h5
h5repack -f UD=32013,0,4,1,0,0,1074921472 unpacked.h5 bigendian.h5

And then added a test-endian1 test that check if h5dump on that file contains 12345 values.

This test fails on little-endian machines without #101 and succeeds with the patch in #101

markcmiller86 commented 1 year ago

FYI...I am working on this. I don't have easy access to a big endian system so I've found a Docker s390x container that is working. I am trying to duplicate the bad behavior that is triggering this code change and so far haven't been able to in spite of having a big endian system.

For a minute, I was worried the two test files test_zfp_be.h5 and test_zfp_le.h5 were mistakenly both little endian. But, they do not appear to be. I am trying to prove that to myself by using od -j <skip> -e to print the original dataset values and prove to myself that doing so on both files returns byte swapped results. I did indeed do that but finding the correct skip value to get to the a file offset that represents the beginning of the original dataset's first chunk was a tad difficult. h5ls emits a dataset location as a pair of ints (e.g. 1:800) and that is certainly not an offset into the file where the dataset starts.

markcmiller86 commented 1 year ago

Ok, well, I guess I should have read the initial comment here.

First, thanks for adding this test!

That is exactly what I was looking for.

Ok, let me just confirm its behaving as I expect for me too.

markcmiller86 commented 1 year ago

@brtnfld I just want to check...how does HDF5 library handle endianness of cd_values array stored to the dataset header? These values are consumed (read and written) by the HDF5 caller but are also stored to the file. I mean, does the lib choose an endianness for the cd_values data stored in the file and automatically byte-swap when cd_values are being read or written in a cross-endian context?

brtnfld commented 1 year ago

From Jordan:

Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message
        /* Filter parameters */
        for (j = 0; j < filter->cd_nelmts; j++)
            UINT32ENCODE(p, filter->cd_values[j]);
So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal

markcmiller86 commented 1 year ago

From Jordan:
Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message
        /* Filter parameters */
        for (j = 0; j < filter->cd_nelmts; j++)
            UINT32ENCODE(p, filter->cd_values[j]);
So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal

Well, in looking at the definition of that macro in H5Fprivate.h...

#  define UINT32ENCODE(p, i) {                              \
   *(p) = (uint8_t)( (i)        & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >>  8) & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >> 16) & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >> 24) & 0xff); (p)++;                      \
}

That is writing the least significant byte to the lowest memory location and the most significant byte to the highest memory location...little endian. So, its storing little endian to the file.

But, it basically reverses this process when returning cd_values to a reader here...

https://github.com/HDFGroup/hdf5/blob/966454aac1231da7209ef81c11055d3312181f99/src/H5Opline.c#L230-L239

using this logic...

#  define UINT32DECODE(p, i) {                              \
   (i)    =  (uint32_t)(*(p) & 0xff);       (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) <<  8); (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) << 16); (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) << 24); (p)++;                  \
}

So, my read is that its always storing little endian to the file but also effectively properly byte-swapping the result when consumer or producer are NOT little endian.

jhendersonHDF commented 1 year ago

@markcmiller86 Ah yes, thanks for the correction. Doing too many things at once..

I agree, it appears we're writing the filter's cd_values to the file as little endian and then byte-swapping on read.

spanezz commented 1 year ago

This is probably of interest, although a bit late: I did try to verify that cd_values are preserved correctly across machines of different endianness, by creating a mock filter plugin that sets cd_values and leaves data unchanged.

Here it is: h5zcat.zip together with an hdf5 file repacked with h5repack on a little endian machine and one repacked on a big endian machine.

markcmiller86 commented 1 year ago

thanks so much @spanezz for the added test. Just what we needed. Cool way to generate the big-endian data too. Much easier than what I was doing. I actually installed development tools and HDF5 on a s390x Docker image and then built and ran the filter and its tests there.

LLNL / H5Z-ZFP

Added new endianness test #102