Closed spanezz closed 1 year ago
FYI...I am working on this. I don't have easy access to a big endian system so I've found a Docker s390x container that is working. I am trying to duplicate the bad behavior that is triggering this code change and so far haven't been able to in spite of having a big endian system.
For a minute, I was worried the two test files test_zfp_be.h5
and test_zfp_le.h5
were mistakenly both little endian. But, they do not appear to be. I am trying to prove that to myself by using od -j <skip> -e
to print the original
dataset values and prove to myself that doing so on both files returns byte swapped results. I did indeed do that but finding the correct skip
value to get to the a file offset that represents the beginning of the original
dataset's first chunk was a tad difficult. h5ls
emits a dataset location
as a pair of ints (e.g. 1:800
) and that is certainly not an offset into the file where the dataset starts.
Ok, well, I guess I should have read the initial comment here.
First, thanks for adding this test!
That is exactly what I was looking for.
Ok, let me just confirm its behaving as I expect for me too.
@brtnfld I just want to check...how does HDF5 library handle endianness of cd_values
array stored to the dataset header? These values are consumed (read and written) by the HDF5 caller but are also stored to the file. I mean, does the lib choose an endianness for the cd_values
data stored in the file and automatically byte-swap when cd_values
are being read or written in a cross-endian context?
From Jordan:
Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message
/* Filter parameters */ for (j = 0; j < filter->cd_nelmts; j++) UINT32ENCODE(p, filter->cd_values[j]);
So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal
From Jordan:
Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message
/* Filter parameters */ for (j = 0; j < filter->cd_nelmts; j++) UINT32ENCODE(p, filter->cd_values[j]);
So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal
Well, in looking at the definition of that macro in H5Fprivate.h
...
# define UINT32ENCODE(p, i) { \
*(p) = (uint8_t)( (i) & 0xff); (p)++; \
*(p) = (uint8_t)(((i) >> 8) & 0xff); (p)++; \
*(p) = (uint8_t)(((i) >> 16) & 0xff); (p)++; \
*(p) = (uint8_t)(((i) >> 24) & 0xff); (p)++; \
}
That is writing the least significant byte to the lowest memory location and the most significant byte to the highest memory location...little endian. So, its storing little endian to the file.
But, it basically reverses this process when returning cd_values
to a reader here...
using this logic...
# define UINT32DECODE(p, i) { \
(i) = (uint32_t)(*(p) & 0xff); (p)++; \
(i) |= ((uint32_t)(*(p) & 0xff) << 8); (p)++; \
(i) |= ((uint32_t)(*(p) & 0xff) << 16); (p)++; \
(i) |= ((uint32_t)(*(p) & 0xff) << 24); (p)++; \
}
So, my read is that its always storing little endian to the file but also effectively properly byte-swapping the result when consumer or producer are NOT little endian.
@markcmiller86 Ah yes, thanks for the correction. Doing too many things at once..
I agree, it appears we're writing the filter's cd_values
to the file as little endian and then byte-swapping on read.
This is probably of interest, although a bit late: I did try to verify that cd_values are preserved correctly across machines of different endianness, by creating a mock filter plugin that sets cd_values and leaves data unchanged.
Here it is: h5zcat.zip together with an hdf5 file repacked with h5repack on a little endian machine and one repacked on a big endian machine.
thanks so much @spanezz for the added test. Just what we needed. Cool way to generate the big-endian data too. Much easier than what I was doing. I actually installed development tools and HDF5 on a s390x Docker image and then built and ran the filter and its tests there.
I added to tests/ a
bigendian.h5
file generated on a s390x big endian machine using:And then added a
test-endian1
test that check if h5dump on that file contains12345
values.This test fails on little-endian machines without #101 and succeeds with the patch in #101