Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Merging content from multiple files #69

Open Blackclaws opened 3 months ago

Blackclaws commented 3 months ago

I've tried naively to merge content from multiple h5 files using PureHDF.

var f1 = H5File.OpenRead("test-results-rx.h5");
var f2 = H5File.OpenRead("test-results-tx.h5");

var merged = new H5File();
merged["rx"] = f1.Group("rx");
merged["tx"] = f2.Group("rx");

merged.Write("merged.h5");

This however results in a file that just contains two scalars with the name of the groups. Is there any support for merging the content of two files in any way?

Apollo3zehn commented 3 months ago

That is because the read group is of type NativeGroup and the write group must be of type H5Group. They are different because PureHDF does not allow arbitrary modifications of existing HDF5 files, hence NativeGroup is readonly and H5Group is writable.

Your use case is interesting and I want to support your approach natively but I cannot implement it quickly because it requires reading arbitrary data into memory without having a predefined type on the C# side available. PureHDF allows reading compound data into C# dictionaries but there are more cases to cover (strings, numeric data, arrays) and that requires a little bit of time to work out properly.

So my advise for now is to iterate recursively to your groups by calling group.Children(), check the element type (group or dataset) and build your merged H5 file based on that. For datasets you need to read the data first and then you can append it to the H5 file to write.

Groups and datasets can have attributes, so they need to be accounted for as well (similar to datasets: read the data and put it to the file to write).

I have no access to a PC right now but if you have any questions I will try to answer them today evening.

Blackclaws commented 3 months ago

For now I've managed to work around the problem by using the HDF-CSharp package and using functions directly from the C library that allow copying of objects. I'll paste it when I get back home later.

Is arbitrary modification of files on the list of future things or completely out of scope? I'm asking because in our use case we use HDF5 files as a standard interchange format to transport output from various tests performed on a sample item by different software (written in python/csharp/etc.) and we then wish to merge them into a single result file. I know that there is a way to create external links to other files which might be another way to merge them but people do wish to download single files and work with them sometimes.

I really enjoy the high level functionality that PureHDF offers and working with the low level functions from the C library via PInvoke is really difficult in comparison, so if that functionality does land in PureHDF at some point I'd be glad.

Is there a reason why you're treating reading files and writing files so differently from an architectural standpoint? As I understand it the writing functionality was implemented later, but couldn't a H5 file be read into memory the same as its being done while writing just with a raw datatype? Alleviating the need for two parallel structures for reading and writing.

Apollo3zehn commented 3 months ago

Is arbitrary modification of files on the list of future things or completely out of scope?

Yes, modification of existing files is out of scope because that significantly increases the complexity of the already quite complex project. It requires a management of the free space in the file which is created when existing structures are being modified (e.g. because the modified structure is larger than the existing one - then it needs to be written somewhere else in the file). It is then also required to find and update all references which point to the modified structure modified. A group of structures which would be affected by this are the many different types of chunk indices.

so if that functionality does land in PureHDF at some point I'd be glad

I will look for a solution to merge files into a new one, just as you tried to do, but modification of existing files is out of scope as explained above.

Is there a reason why you're treating reading files and writing files so differently from an architectural standpoint

The design principle of the reading API is

That is why the reading API is not using standard properties like group.Children but methods instead (group.Children() or group.ChildrenAsync(), respectively).

The writing API has different needs:

This requires the use of readable and writable properties as well as C# indexers (for the dictionary-like assignment) instead of (async) methods.

So (unfortunately) there are fundamentally different requirements for both APIs.

but couldn't a H5 file be read into memory the same as its being done while writing just with a raw datatype

I am not sure what you mean with raw data type. If there would be only strings of bytes (byte[], int[], float[], double[], etc) in the source HDF5 file, it would be trivial to treat the data as byte array and copy it over to the target file. But as soon as strings and variable-length data are involved, then there are still byte arrays in the source file but these are just pointers to the global heap, which consists of one or more regions in the H5 file where referenced data is stored. The concept is similar to what C#/.NET does with its distinction of value-types (int, double, ...) vs. reference types (arrays, strings) in C#.

So how would we represent this (arbitrarily complex) data in C# memory? The logical approach would be to define a matching C# type and let PureHDF handle the deserialization into memory and from there we can write it to the target file. The problem is that for a general implementation, the matching C# type may not exist (for instance when when deal with (nested) compound data). PureHDF has a solution for this by reading everything into a dictionary of type Dictionary<string, object>. But not all kind data fit into a dictionary, e.g. scalars. So that is why I need some time to figure out how to cover all cases (primitive scalars, complex (= compound) scalars, arrays). This is possible but requires maybe 2 days of work. Unfortunately I am quite time-constrained right now so I cannot promise when I will find the time for it :-(

Blackclaws commented 2 months ago

So as promised here is the code snippet I currently use to combine two trees:

long fileId = Hdf5.OpenFile("test-results-rx.h5", true);
long fileIdTwo = Hdf5.OpenFile("test-results-tx.h5", true);

long mergedFile = Hdf5.CreateFile("merged.h5");

HDF.PInvoke.H5O.copy(fileId, "rx", mergedFile, "rx");
HDF.PInvoke.H5O.copy(fileIdTwo, "tx", mergedFile, "tx");

Hdf5.CloseFile(fileId);
Hdf5.CloseFile(fileIdTwo);

Hdf5.Flush(mergedFile, H5F.scope_t.GLOBAL);
Hdf5.CloseFile(mergedFile);

This uses HDF5Csharp and HDF.PInvoke.

This is possible but requires maybe 2 days of work. Unfortunately I am quite time-constrained right now so I cannot promise when I will find the time for it :-(

Please don't stress this. Other features are much more relevant and this project is already great. I have a working solution that while not perfect deals well with my use case.

I'm also not too familiar with all features the HDF5 standard supports especially in regards to older versions as I'm only now starting to implement it as a default interchange format for lab results in our group (I've found your project to be very helpful here).

I think my naive view was just that you can read the entire hdf5 file, build an equivalent model as the write model and then just allow modification on that, meaning that on write the whole file will be written to disk at once (basically iterating the original file and pushing the data from it into a new H5File object). I think this might be doable with some helper methods though I'm not sure how easy it is to just dump the original dataset into the write model without specifying a specific type as that is not really one of the use cases.

This of course also means that you would have to have enough memory to support the complete content of the file being read, which for use cases that deal with really large datasets might be an issue.

Anyway as I said, please don't stress about this and put it at the bottom of the backlog if at all :)