Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Shuffle filter breaks roundtripping (seen in 3d array of variable-length lists) #73

Closed marklam closed 2 months ago

marklam commented 2 months ago

I've updated my repro-repo to demonstrate:

https://github.com/marklam/Roundtrip3DArrayOfStructList/tree/01847ceae628323832465e8a94841c1b4cab4286

It seems that if the dataset is created with the shuffle filter enabled (it doesn't need to be compressed), then the data read back is misaligned.

Apollo3zehn commented 2 months ago

The error was not detected earlier because I only had tests for the hardware-accelerated version of the shuffle algorithm for the type sizes = 1, 2, 4 and 8 bytes. When using variable-length data, the actual data is stored in the global heap (without compression!). The data which is being compressed are the references (= pointers) for all the objects in the global heap. These references have a size of 16 byte and I did not have tests for this type size. It is important to test this case because for 16 bytes a different shuffle algorithm is being used. I did not have tests for this type size because previously I did not know how create proper test data using the original HDF5 library. I found a solution to that problem and have added some tests.

The actual bug was caused by the fact that the hardware-accelerated shuffle implementation is being auto-translated from C to C# and the code for that is being taken from the Blosc2 repository. The hardware-accelerated version of the function Avx2.Shuffle in the C code requires a reversed shuffle mask (as per comment in the file in the Blosc2 repo). And the C# function Avx2.Shuffle requires the shuffle mask not being reversed. The auto-translation procedure of the C code did not reversed the shuffle mask array and then the shuffle function produced garbage.

Here is the double-reversed shuffle mask for 16 byte types which should work fine now:

https://github.com/Apollo3zehn/PureHDF/blob/2fd5ca9b252af75d20860416ef6b2e9b517c9159/src/PureHDF/Filters/ShuffleAvx2.cs#L235-L243

I fear that for data types > 16 bytes the error persists because for this kind of data yet another algorithm with a different shuffle mask is being used. But I have again the problem that I do not know how to create proper test data. I created issue #75 to cover this.

PureHDF 1.0.0-beta.11 should solve the error for you :-) I will try to have a look into the other issue tomorrow.