Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
50 stars 18 forks source link

Loading time in HDFView and in my C# code #62

Open XTRY1337 opened 7 months ago

XTRY1337 commented 7 months ago

Hi, sorry to create this topic as an issue, but I don't know a better way to contact you! I have some HDF5 files, with about 70MB size, which take HDFView only +/- 2 seconds to open the file but In my C# code with PureHDF it takes +/- 6 seconds! Is this time difference normal? Or is it usually possible that the times are similar and it's something I might be doing wrong in my code?

Apollo3zehn commented 7 months ago

Hi, this might be caused by bad chunk size. Could you please send me some info about the dataset (you can find it in HDF View) and show me your code?

If you are willing to you could also send me your file to purehdf_issue_61@m1.apollo3zehn.net or a link to that file in case it is too big.

Apollo3zehn commented 7 months ago

In general, PureHDF is not yet performance optimized (I always had performance in mind during development but there are still areas for improvement, e.g. more caching).

XTRY1337 commented 7 months ago

Dear friend, I send the files to that email: purehdf_issue_61@m1.apollo3zehn.net. Let me know if you have any news. Thanks for your time!

Apollo3zehn commented 7 months ago

Thank you, I received it :-)

XTRY1337 commented 7 months ago

Let me know if you need anything!

Apollo3zehn commented 7 months ago

I had a look into the file and have some questions:

In the mail you said it takes ~ 5 seconds for PureHDF to load the data and only 2 seconds in HDF View. To reproduce that I would need to know which dataset exactly you mean and what the layout of that dataset is (chunked, contiguous, compact) and its dimensionality as well as chunk layout (if it applies) and compression.

Or do you mean that HDF View takes 2 seconds to open the file and show the groups / datasets overview? That may be the case (it is on my computer) but HDF View has not loaded any data yet at that point. It only loads data when you double click a dataset. Simply showing the groups and datasets is way faster than also loading their contents. In that case PureHDF does a good job I'd say :-) Performance will probably improve when metadata cache is implemented.

XTRY1337 commented 7 months ago

When I talk about time, I mean from when the file starts loading until I have everything loaded in my instances. I'm not a expert on HDF5 files, but from what I can see I think every dataset is contiguous. I Just have create the file in Python in the most simple way possible.

So it makes sense that it would be much faster in HDF view because of this! Normally the structure that takes the longest is Tde and Tdr because they have so many datasets in a row due to different groups.

And in general, for these simple files, do you think that's the best way to get values?

When you refer to this metadata cache, is it something that will be implemented automatically in the functions in the package or is it something that the user will have to change later in the code?

Apollo3zehn commented 7 months ago

You might get a little speed-up by parallelizing the reading:

private static List<List<List<List<double>>>> ReadTdStruct(string groupPath, NativeFile readFile)
{
    List<List<List<List<double>>>> TdList = new();
    // foreach (IH5Group tree1 in readFile.Group(groupPath).Children().OfType<IH5Group>())
    // {
    //     List<List<List<double>>> TdSubList = new();
    //     foreach (IH5Group tree2 in tree1.Children().OfType<IH5Group>())
    //     {
    //         List<List<double>> TdSubSubList = new();
    //         foreach (IH5Dataset dataset in tree2.Children().OfType<IH5Dataset>())
    //             TdSubSubList.Add(dataset.Read<double[]>().ToList());

    //         TdSubList.Add(TdSubSubList);
    //     }
    //     TdList.Add(TdSubList);
    // }

    var level1Children = readFile
        .Group(groupPath)
        .Children()
        .OfType<IH5Group>()
        .ToList();

    Parallel.For(0, level1Children.Count, index =>
    {
        var level1Child = level1Children[index];

        List<List<List<double>>> TdSubList = new();

        foreach (IH5Group tree2 in level1Child.Children().OfType<IH5Group>())
        {
            List<List<double>> TdSubSubList = new();
            foreach (IH5Dataset dataset in tree2.Children().OfType<IH5Dataset>())
                TdSubSubList.Add(dataset.Read<double[]>().ToList());

            TdSubList.Add(TdSubSubList);
        }
        TdList.Add(TdSubList);
    });

    return TdList;
}

The metadata cache would be implemented behind the scenes and is nothing for the user to worry about. However, I need to first figure out how that would work in combination with multithreading and if the speed gain is worth it.

XTRY1337 commented 7 months ago

Using Parallel.For is doubling the time, I made a debug on it and inside the loop that code repeat a lot of times the same instruction unnecessary. But thanks for the try, I will wait for a new performance update.

Apollo3zehn commented 7 months ago

Interesting - for me it reduces the execution time from 1.4 s to 0.90 s with dotnet run -c Release and the following main method:

public static void Main()
{
    var time = Stopwatch.StartNew();
    LoadHdfFile("/home/vincent/Downloads/francisco/Example/exemplo.h5");
    Console.WriteLine(time.ElapsedMilliseconds);
}
XTRY1337 commented 7 months ago

Very strange, my code pass from like 9000ms to 34000ms when use Parallel