Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Unable to open file written with PureHDF #88

Closed Blackclaws closed 1 month ago

Blackclaws commented 1 month ago

I've run into a weird issue where a file written out by PureHDF cannot be opened properly by other tools.

H5Web runs into an infinite loading loop and H5Dump in its default state just hangs as well.

h5dump -n correctly dumps all the constituent entries.

When trying to dump individual datasets h5dump stalls:

DATASET "/SystemState/ReferenceTxResult/ReferenceTxMeasurement/DCAMeasurement/EyeMeasurement/AveragePower" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SCALAR
   DATA {
   (0): 0.77
   }
   ATTRIBUTE "Interpretation" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR

Reading the dataset with PureHDF works....

Blackclaws commented 1 month ago

Printing just the header works fine:

HDF5 "SOEA2405C2002-result-2024-05-17T10_55_36.h5" {
DATASET "/SystemState/ReferenceTxResult/ReferenceTxMeasurement/DCAMeasurement/EyeMeasurement/AveragePower" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SCALAR
   ATTRIBUTE "Interpretation" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   ATTRIBUTE "Unit" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
}
}
Blackclaws commented 1 month ago

So apparently the problem arises when reading string type attributes:

h5dump -a /SystemState/ReferenceTxResult/ReferenceTxMeasurement/DCAMeasurement/EyeMeasurement/AveragePower/Unit SOEA2405C2002-result-2024-05-17T10_55_36.h5
HDF5 "SOEA2405C2002-result-2024-05-17T10_55_36.h5" {
ATTRIBUTE "Unit" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR

reading the dataset data works fine when suppressing attributes.

Blackclaws commented 1 month ago

One thing to note here is that I am using the Unit attribute and Interpretation attribute liberally throughout the file. However I've used them before and did not run into these issues. Now with a larger result file it appears those attributes are bringing problems.

It seems also that datasets containing variable strings in general in this file are unreadable by h5dump while they read fine from PureHDF.

Apollo3zehn commented 1 month ago

Thanks for the bug report! Do you have the possibility to send an example file to purehdf-issue-88@m1.apollo3zehn.net? This would make investigation much easier. Or, alternatively, if you have some code snippet so I can reproduce the problem.

Thanks :-)

Blackclaws commented 1 month ago

Thanks for the bug report! Do you have the possibility to send an example file to purehdf-issue-88@m1.apollo3zehn.net? This would make investigation much easier. Or, alternatively, if you have some code snippet so I can reproduce the problem.

Thanks :-)

Here is a minimal example that reproduces this problem:

using PureHDF;

var reproducibleProblem = new H5File()
{
    Attributes =
    {
        ["NX_class"] = "Nxcollection"
    },
};
for (uint i = 0; i < 120; i++)
{
    var result = new H5Group()
    {
        Attributes = { ["NX_class"] = "Nxcollection" },
        ["AMeasurementGroup"] = new H5Group()
        {
            Attributes = { ["NX_class"] = "Nxentry" },
            ["BMeasurementGroup"] = new H5Group()
            {
                ["CDataset"] = new H5Dataset(0d) { Attributes = { ["Interpretation"] = "a long string" } },
                ["DMeasurementGroup"] = new H5Group()
                {
                    ["E"] = new H5Dataset(0d) { Attributes = { ["Interpretation"] = "a different string" } }
                }
            }
        }
    };
    reproducibleProblem[$"channel_{i}"] = result;
}

reproducibleProblem.Write("reproducible.h5");

Removing any of the datasets or groups or changing the strings to be the same or having much shorter strings as attributes makes h5dump work again. This seems to be a sort of minimal configuration.

Also less iterations make this work again.

The same effect can be had with 8 iterations of the loop and much more content within the groups/datasets. I'm guessing we're running over some buffer here.

If you remove the NX_class attributes it still breaks in h5dump but doesn't break in h5web anymore.

Related issue: https://github.com/silx-kit/h5web/issues/1645

Blackclaws commented 1 month ago

Here's two minimal repro files, one without the Nxentry attribute (see related issue): repro.zip

Both fail in h5dump, the one without the Nxentry in attributes opens in h5web but not in h5dump.

Apollo3zehn commented 1 month ago

There are some places where PureHDF casts from uint to int and vice versa which might be a source of such kind of errors. I will investigate the problem tomorrow or on monday. Thanks for the minimal reproduction example!

Apollo3zehn commented 1 month ago

It was a stupid error in the serialization of the global heap collections (see 396bbe3) which was triggered when a global heap collection was full and another one is about to be created. The collections hold all variable-length data, mostly strings.

v1.0.0-beta.18 should solve your problem :-)

Apollo3zehn commented 1 month ago

In case it matters: already created files can be repaired (some zeros need to be replaced with the value 0xF at certain positions within the file)

Blackclaws commented 1 month ago

In case it matters: already created files can be repaired (some zeros need to be replaced with the value 0xF at certain positions within the file)

Could you give me a hint which zeros? :D And thanks a lot for the quick turnaround

Apollo3zehn commented 1 month ago

With the following code I was able to repair the reproducible.h5 file created by your script above. Please only apply this script to copies of your data because it is not well tested.

using System.Text;

var brokenFile = "<path to file>";
var fileStream = File.Open(brokenFile, FileMode.Open, FileAccess.ReadWrite);

using var binaryReader = new BinaryReader(fileStream);

for (int i = 0; i < binaryReader.BaseStream.Length; i++)
{
    var currentByte = binaryReader.ReadByte();

    if (currentByte == 'G')
    {
        var magicString = Encoding.ASCII.GetString(binaryReader.ReadBytes(3));

        // we found a global heap collection
        if (magicString == "COL")
        {
            Console.WriteLine("Found global heap collection at offset 0x" + i.ToString("X"));

            var collectionSize = RepairGlobalHeapCollection(binaryReader);
            i += (int)collectionSize;
            binaryReader.BaseStream.Position = i;
        }

        // false alarm
        else
        {
            binaryReader.BaseStream.Position -= 3;
        }
    }
}

ulong RepairGlobalHeapCollection(BinaryReader driver)
{
    // version
    var version = driver.ReadByte();

    // reserved
    driver.ReadBytes(3);

    // collection size
    var collectionSize = driver.ReadUInt64();
    var headerSize = 8UL + 8UL;
    var remaining = collectionSize;

    while (remaining > headerSize)
    {
        var before = driver.BaseStream.Position;
        var globalHeapObject = GlobalHeapObject.Decode(driver);

        // Global Heap Object 0 (free space) can appear at the end of the collection.
        if (globalHeapObject.ObjectIndex == 0)
        {
            // fix missing object 0 size
            if (remaining == 32)
            {
                using var binaryWriter = new BinaryWriter(fileStream, Encoding.Default, leaveOpen: true);
                binaryWriter.BaseStream.Position += 6;
                binaryWriter.Write((ulong)0x10);
                Console.WriteLine("Repaired global heap collection");
            }
            else
            {
                Console.WriteLine("The global heap collection does not need to be repaired");
            }

            break;
        }

        var after = driver.BaseStream.Position;
        var consumed = (ulong)(after - before);

        remaining -= consumed;
    }

    return collectionSize;
}

internal readonly record struct GlobalHeapObject(
    ushort ObjectIndex,
    ushort ReferenceCount,
    byte[] ObjectData
)
{
    public static GlobalHeapObject Decode(BinaryReader driver)
    {
        // heap object index
        var heapObjectIndex = driver.ReadUInt16();

        if (heapObjectIndex == 0 /* free space object */)
        {
            return new GlobalHeapObject(
                ObjectIndex: default,
                ReferenceCount: default,
                ObjectData: Array.Empty<byte>()
            );
        }

        // reference count
        var referenceCount = driver.ReadUInt16();

        // reserved
        driver.ReadBytes(4);

        // object size
        var objectSize = driver.ReadUInt64();

        // object data
        var objectData = driver.ReadBytes((int)objectSize);

        var paddedSize = (int)(Math.Ceiling(objectSize / 8.0) * 8);
        var remainingSize = paddedSize - (int)objectSize;
        driver.ReadBytes(remainingSize);

        return new GlobalHeapObject(
            ObjectIndex: heapObjectIndex,
            ReferenceCount: referenceCount,
            ObjectData: objectData
        );
    }
}