Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

To Slow reading multiple datasets/group #55

Closed XTRY1337 closed 5 months ago

XTRY1337 commented 5 months ago

Hello, I'm trying to read a sequential folder/group in my HDF5 Ex: Group1/Group2/Group3/Datasets and they have the size for example: Size Group1 = 8, Size Group2 = 6, Size Group3 = 400, Size Dataset = 2 doubles values. But basically, since I don't have, for example, a function that can read everything directly from a group, I have to do this code to go through all the groups and this usually takes about 20/30 seconds before I can load everything, because I'm read 2 struct of that type that, which makes a total of 76,800 values to be read. Is there any way/function I can use from the library to reduce the time this reading takes? My code: image

Apollo3zehn commented 5 months ago

Which version of PureHDF are you using? The newer ones should not allow Read<double>().ToList() and instead Read<double[]>().ToList() needs to be used to make it work like before.

PureHDF is not yet optimized for high performance, especially when in comes to many small groups / datasets since there is no metadata cache yet. Reading big datasets (with reasonable chunk size) should be quite fast, though.

There is still some space for improvements in your code. We can dramatically reduce the number of file structure lookups by using the following code:

List<List<List<List<double>>>> ReadStruct(string groupPath, NativeFile readFile)
{
    List<List<List<List<double>>>> Group1 = new();

    foreach (var group1 in readFile.Group(groupPath).Children().OfType<IH5Group>())
    {
        List<List<List<double>>> Group2 = new();
        foreach (var group2 in group1.Children().OfType<IH5Group>())
        {
            List<List<double>> Group3 = new();

            foreach (var dataset in group2.Children().OfType<IH5Dataset>())
            {
                Group3.Add(dataset.Read<double[]>().ToList());
            }

            Group2.Add(Group3);
        }

        Group1.Add(Group2);
    }

    return Group1;
}

Maybe this speeds up things a little bit. If that does not help, an example file would be useful to see where the bottleneck is.

XTRY1337 commented 5 months ago

I'm using 1.0.0-alpha.25, but using .OfType<> and improve a lot the performance, thank you very much for that. Keep your good work.

Apollo3zehn commented 5 months ago

I am glad it works better now :-)

Note: When upgrading to one of the newer beta versions there will be a few breaking changes as the one mentioned above (double vs double[]).

More info here: https://github.com/Apollo3zehn/PureHDF/releases/tag/v1.0.0-beta.1