Closed Blackclaws closed 1 month ago
Printing just the header works fine:
HDF5 "SOEA2405C2002-result-2024-05-17T10_55_36.h5" {
DATASET "/SystemState/ReferenceTxResult/ReferenceTxMeasurement/DCAMeasurement/EyeMeasurement/AveragePower" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
ATTRIBUTE "Interpretation" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "Unit" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
}
}
So apparently the problem arises when reading string type attributes:
h5dump -a /SystemState/ReferenceTxResult/ReferenceTxMeasurement/DCAMeasurement/EyeMeasurement/AveragePower/Unit SOEA2405C2002-result-2024-05-17T10_55_36.h5
HDF5 "SOEA2405C2002-result-2024-05-17T10_55_36.h5" {
ATTRIBUTE "Unit" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
reading the dataset data works fine when suppressing attributes.
One thing to note here is that I am using the Unit attribute and Interpretation attribute liberally throughout the file. However I've used them before and did not run into these issues. Now with a larger result file it appears those attributes are bringing problems.
It seems also that datasets containing variable strings in general in this file are unreadable by h5dump while they read fine from PureHDF.
Thanks for the bug report! Do you have the possibility to send an example file to purehdf-issue-88@m1.apollo3zehn.net? This would make investigation much easier. Or, alternatively, if you have some code snippet so I can reproduce the problem.
Thanks :-)
Thanks for the bug report! Do you have the possibility to send an example file to purehdf-issue-88@m1.apollo3zehn.net? This would make investigation much easier. Or, alternatively, if you have some code snippet so I can reproduce the problem.
Thanks :-)
Here is a minimal example that reproduces this problem:
using PureHDF;
var reproducibleProblem = new H5File()
{
Attributes =
{
["NX_class"] = "Nxcollection"
},
};
for (uint i = 0; i < 120; i++)
{
var result = new H5Group()
{
Attributes = { ["NX_class"] = "Nxcollection" },
["AMeasurementGroup"] = new H5Group()
{
Attributes = { ["NX_class"] = "Nxentry" },
["BMeasurementGroup"] = new H5Group()
{
["CDataset"] = new H5Dataset(0d) { Attributes = { ["Interpretation"] = "a long string" } },
["DMeasurementGroup"] = new H5Group()
{
["E"] = new H5Dataset(0d) { Attributes = { ["Interpretation"] = "a different string" } }
}
}
}
};
reproducibleProblem[$"channel_{i}"] = result;
}
reproducibleProblem.Write("reproducible.h5");
Removing any of the datasets or groups or changing the strings to be the same or having much shorter strings as attributes makes h5dump work again. This seems to be a sort of minimal configuration.
Also less iterations make this work again.
The same effect can be had with 8 iterations of the loop and much more content within the groups/datasets. I'm guessing we're running over some buffer here.
If you remove the NX_class attributes it still breaks in h5dump but doesn't break in h5web anymore.
Related issue: https://github.com/silx-kit/h5web/issues/1645
Here's two minimal repro files, one without the Nxentry attribute (see related issue): repro.zip
Both fail in h5dump, the one without the Nxentry in attributes opens in h5web but not in h5dump.
There are some places where PureHDF casts from uint
to int
and vice versa which might be a source of such kind of errors. I will investigate the problem tomorrow or on monday. Thanks for the minimal reproduction example!
It was a stupid error in the serialization of the global heap collections (see 396bbe3) which was triggered when a global heap collection was full and another one is about to be created. The collections hold all variable-length data, mostly strings.
v1.0.0-beta.18 should solve your problem :-)
In case it matters: already created files can be repaired (some zeros need to be replaced with the value 0xF at certain positions within the file)
In case it matters: already created files can be repaired (some zeros need to be replaced with the value 0xF at certain positions within the file)
Could you give me a hint which zeros? :D And thanks a lot for the quick turnaround
With the following code I was able to repair the reproducible.h5
file created by your script above. Please only apply this script to copies of your data because it is not well tested.
using System.Text;
var brokenFile = "<path to file>";
var fileStream = File.Open(brokenFile, FileMode.Open, FileAccess.ReadWrite);
using var binaryReader = new BinaryReader(fileStream);
for (int i = 0; i < binaryReader.BaseStream.Length; i++)
{
var currentByte = binaryReader.ReadByte();
if (currentByte == 'G')
{
var magicString = Encoding.ASCII.GetString(binaryReader.ReadBytes(3));
// we found a global heap collection
if (magicString == "COL")
{
Console.WriteLine("Found global heap collection at offset 0x" + i.ToString("X"));
var collectionSize = RepairGlobalHeapCollection(binaryReader);
i += (int)collectionSize;
binaryReader.BaseStream.Position = i;
}
// false alarm
else
{
binaryReader.BaseStream.Position -= 3;
}
}
}
ulong RepairGlobalHeapCollection(BinaryReader driver)
{
// version
var version = driver.ReadByte();
// reserved
driver.ReadBytes(3);
// collection size
var collectionSize = driver.ReadUInt64();
var headerSize = 8UL + 8UL;
var remaining = collectionSize;
while (remaining > headerSize)
{
var before = driver.BaseStream.Position;
var globalHeapObject = GlobalHeapObject.Decode(driver);
// Global Heap Object 0 (free space) can appear at the end of the collection.
if (globalHeapObject.ObjectIndex == 0)
{
// fix missing object 0 size
if (remaining == 32)
{
using var binaryWriter = new BinaryWriter(fileStream, Encoding.Default, leaveOpen: true);
binaryWriter.BaseStream.Position += 6;
binaryWriter.Write((ulong)0x10);
Console.WriteLine("Repaired global heap collection");
}
else
{
Console.WriteLine("The global heap collection does not need to be repaired");
}
break;
}
var after = driver.BaseStream.Position;
var consumed = (ulong)(after - before);
remaining -= consumed;
}
return collectionSize;
}
internal readonly record struct GlobalHeapObject(
ushort ObjectIndex,
ushort ReferenceCount,
byte[] ObjectData
)
{
public static GlobalHeapObject Decode(BinaryReader driver)
{
// heap object index
var heapObjectIndex = driver.ReadUInt16();
if (heapObjectIndex == 0 /* free space object */)
{
return new GlobalHeapObject(
ObjectIndex: default,
ReferenceCount: default,
ObjectData: Array.Empty<byte>()
);
}
// reference count
var referenceCount = driver.ReadUInt16();
// reserved
driver.ReadBytes(4);
// object size
var objectSize = driver.ReadUInt64();
// object data
var objectData = driver.ReadBytes((int)objectSize);
var paddedSize = (int)(Math.Ceiling(objectSize / 8.0) * 8);
var remainingSize = paddedSize - (int)objectSize;
driver.ReadBytes(remainingSize);
return new GlobalHeapObject(
ObjectIndex: heapObjectIndex,
ReferenceCount: referenceCount,
ObjectData: objectData
);
}
}
I've run into a weird issue where a file written out by PureHDF cannot be opened properly by other tools.
H5Web runs into an infinite loading loop and H5Dump in its default state just hangs as well.
h5dump -n
correctly dumps all the constituent entries.When trying to dump individual datasets h5dump stalls:
Reading the dataset with PureHDF works....