Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Support for writing H5 references #95

Closed W-Sebastian closed 1 month ago

W-Sebastian commented 1 month ago

First, very nice library, great potential, I love the lack of third party dependencies compared with the other alternatives.

I can't figure out for now if references are supported when writing. Looked into the docs and also did a grep for some keywords in the unit tests but didn't found anything meaningful. I also verified against the checklist from the super issue and didn't noticed this as a task (completed or not).

Example:

new H5File()
{
    ["data_set"] = new H5Dataset(...),
    ["another_dataset"] = new H5dataset(...)
    {
        Attributes = new()
        {
            ["link"] = <references for "data_set" here>
        }
    }
}

In this case link attribute should be an Object References towards the data_set.

I did noticed that this case is supported on reading and works fine, the reference is read as a NativeObjectReference1 which can easily be resolved with an appropriate Get(...) call. But this would be quite an important feature to be able to also write them back.

Thanks!

Apollo3zehn commented 1 month ago

I am glad this library is useful to you. Writing references was no priority yet. I will try to find a solution today evening. It might be a bit difficult because the referenced dataset must be written first so the address (=reference) is known. Or, alternatively, the addresses of the referenced datasets will be collected and written to the file at the very end of the write process.

W-Sebastian commented 1 month ago

Thank you, feel free to prioritize it as you see fit.

Indeed, I think the most elegant solution is to gather the references at the end of the process and resolve them but probably comes with a bit of complexity and/or performance penalty. I suggest the following logic in order to reduce the impact:

save the address of each data set or group in a map (based on some kind of ID or the Path)
when a reference is encountered:
    if the map contains the referenced object
        write the address
    else
        write a null and save this address as an unresolved reference + the reference ID

if there are unresolved references:
    seek them in the file and update the null to the appropriate address from the map

The impact is minimal (the extra map being maintained) and for most cases the extra pass will not be needed (if all references point towards previously written objects only). Seeking the file at the end would have some performance overhead but will only kick in when needed. Additionally, the process of reference solving and addresses is completely encapsulated from the API and can be always changed in the future.

For cases where BeingWrite(...) is using, the process should be identical but the extra pass will only happen before disposing.

The reference can then be defined by a string path (or any object which can be resolved to a path).

Example from before (with some additions) would become:

new H5File()
{
    ["data_set"] = new H5Dataset(data),
    ["another_dataset"] = new H5Dataset(data)
    {
        Attributes = new()
        {
            ["links"] = new[] { Reference.FromPath("data_set"), Reference.FromPath("group/data_set") }
        }
    },
    ["group"] = new H5Group()
    {
        ["data_set"] = new H5Dataset(data)
    }
};

Another approach perhaps a bit more straightforward but quite limited is to return the reference when using the writer API:

using var writer = file.BeginWrite("example.sc_h5");
IReference ref1 = writer.Write(new H5Dataset<double[][]>(), data);

This might be easier to implement but I think it leaks some internals in the API, enabled some new bugs/error cases for the returned reference as well as forces the user to the BeginWrite approach when references are used instead of letting it an open choice. It will not support cross-references either.

Apollo3zehn commented 1 month ago

The reference-to-address map is already there for another purpose (hard links) which can be reused here. Right now I am experimenting with using the dataset object itself to create the reference instead of using a string path:

// Arrange
var dataset1 = new H5Dataset(data: 1);
var dataset2 = new H5Dataset(data: 2);

var file = new H5File
{
    ["data1"] = dataset1,
    ["data2"] = dataset2,
    ["reference"] = new H5ObjectReference[] { dataset1, dataset2 }
};

Would this suit your needs or do you need the string path approach?

Apollo3zehn commented 1 month ago

The implementation was simpler than expected because the reference-to-address map was already in place and in case it does not contain the referenced object, the object (group or dataset) will be encoded on the fly to obtain the address:

https://github.com/Apollo3zehn/PureHDF/blob/0aaf1bb3343adabce2e286ea55f95c117b176360/src/PureHDF/VOL/Native/FileFormat/Level2/ObjectHeaderMessages/Datatype/DatatypeMessage.Writing.cs#L762-L770

v1.0.0-beta.22 includes this new feature. It would be great if you could test it. Thanks!

Apollo3zehn commented 1 month ago

Ah, I forgot to mention that this implementation does not protect against circular references.

Apollo3zehn commented 1 month ago

Sorry, I introduced a bug: the address returned by EncodeGroup and EncodeDataset has not been put into the address map and so they were potentially encoded multiple times.

v1.0.0-beta.23 includes a fix for this and also a circular reference protection.

W-Sebastian commented 1 month ago

Nice, I did test the version beta.23 and it works perfectly. I do have a follow-up question, is the issue https://github.com/Apollo3zehn/PureHDF/issues/49 related to writing fixed arrays as data since right now it always writes them as variable length data or it is about something else?

Apollo3zehn commented 1 month ago

Yes, with normal C# features there is no native way to express that a generic array type has a fixed length. With this new feature (https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-12#inline-arrays) it might become possible.

There is another (older and unsafe) way to specify array lengths using the fixed keyword: https://stackoverflow.com/questions/2881422/why-must-fixed-size-buffers-arrays-be-declared-unsafe

I want to investigate both to see how PureHDF can support the HDF5 array data type.