Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
50 stars 18 forks source link

hyperslab and threads: what could go wrong? #19

Closed fmarionpoll closed 3 years ago

fmarionpoll commented 3 years ago

Hi all,

I am trying to read electrophysiology data from an H5file, whereby data are stored as 200 (data along time) x 1028 (channels) ushort and compressed. There is an enormous number of these chunks (about 3 000 000) (chunk size is certainly not optimal, but this what I get) and reading one channel of such data takes ages (aka 28 - 30s from a regular disk).

While I can read data which are stored as chunks and compressed using a direct approach or by reading each chunk in turn (it takes then 50 s), I thought that I could do multithreading, considering that by doing so, I could gain on the time necessary to decompress the data, knowing that the computer I work with has 12 cores.

However, when doing so, I get errors of groups not found or such, after a variable number of loops (typically 5 to 20). Any guess why this might occur?

Thank you of any help or cue, Fred

Here is the code that fails (it works however if I replace the "Parellel.For" loop with a regular "for" loop):

public ushort[] ReadAll_OneElectrodeAsIntParallel(ElectrodeProperties electrodeProperties) { H5Group group = Root.Group("/"); H5Dataset dataset = group.Dataset("sig"); var nbdatapoints = dataset.Space.Dimensions[1]; // any size* const ulong chunkSizePerChannel = 200; var result = new ushort[nbdatapoints]; var nchunks = (long) (nbdatapoints / chunkSizePerChannel) ;

        int ndimensions = dataset.Space.Rank;
        if (ndimensions != 2)
            return null;

        Parallel.For (0, nchunks, i =>
        {
            var istart = (ulong) i * chunkSizePerChannel;
            var iend = istart + chunkSizePerChannel - 1;
            if (iend > nbdatapoints)
                iend = nbdatapoints - 1;
            var chunkresult = Read_OneElectrodeDataAsInt(group, dataset, electrodeProperties.Channel, istart, iend);
            Array.Copy(chunkresult, 0, result, (int) istart, (int) (iend - istart + 1));
        }) ;

        return result;
    }

Here is the code that works: public ushort[] ReadAll_OneElectrodeAsInt(ElectrodeProperties electrodeProperties) { H5Group group = Root.Group("/"); H5Dataset dataset = group.Dataset("sig"); int ndimensions = dataset.Space.Rank; if (ndimensions != 2) return null; var nbdatapoints = dataset.Space.Dimensions[1]; // any size* return Read_OneElectrodeDataAsInt(group, dataset, electrodeProperties.Channel, 0, nbdatapoints -1); }

Here is the function called by both routines: public ushort[] Read_OneElectrodeDataAsInt(H5Group group, H5Dataset dataset, int channel, ulong startsAt, ulong endsAt) { var nbPointsRequested = endsAt - startsAt + 1;

        //Trace.WriteLine($"startsAt: {startsAt} endsAt: {endsAt} nbPointsRequested={nbPointsRequested}");

        var datasetSelection = new HyperslabSelection(
            rank: 2,
            starts: new[] { (ulong)channel, startsAt },         // start at row ElectrodeNumber, column 0
            strides: new ulong[] { 1, 1 },                      // don't skip anything
            counts: new ulong[] { 1, nbPointsRequested },       // read 1 row, ndatapoints columns
            blocks: new ulong[] { 1, 1 }                        // blocks are single elements
        );

        var memorySelection = new HyperslabSelection(
            rank: 1,
            starts: new ulong[] { 0 },
            strides: new ulong[] { 1 },
            counts: new[] { nbPointsRequested },
            blocks: new ulong[] { 1 }
        );

        var memoryDims = new[] { nbPointsRequested };
        var result = dataset
            .Read<ushort>(
                fileSelection: datasetSelection,
                memorySelection: memorySelection,
                memoryDims: memoryDims
            );

        return result;
    }
Apollo3zehn commented 3 years ago

HDF5.NET is not yet thread-safe. The main reason is that per instance there is a single BinaryReader. When you read in parallel, the base stream's position is changed in an uncoordinated way. A solution to that could be the replacement of the binary reader with a MemoryMappedFile (example: https://github.com/Apollo3zehn/UDBF.NET/blob/master/src/UDBF.NET/UDBFFile.cs#L210). This MMF reader should be integrated into the H5BinayReader https://github.com/Apollo3zehn/HDF5.NET/blob/master/src/HDF5.NET/Core/H5BinaryReader.cs. MAYBE then it will work. Another workaround could be to open the file once per thread, so that each thread uses its own BinaryReader. Of course this has some overhead. With your small chunk size this won't be a solution I think.

Apollo3zehn commented 3 years ago

It is just that I did not have the time yet to make the lib thread safe.

Apollo3zehn commented 3 years ago

If you open the file once per thread and then you read multiple chunks per thread using a well chosen hyperslab, maybe it would improve the performance.

fmarionpoll commented 3 years ago

This looks like an interesting workaround. I will look into how MemoryMappedFile works and how to adapt it to H5BinaryReader. I'll post if I find a way. Thank you. for your suggestions.

Apollo3zehn commented 3 years ago

Maybe the multithreading approach alone will not gain significant performance boost because in the end, the read speed depends on the disk speed. However, replacing BinaryReader by MMF (single threaded) would probably give a good performance enhancement because the number of system calls would be reduced significantly. In another project this gave a 10x performance boost but that depends mostly on the file structure. Since HDF5 is read non-linearly and there are many jumps in the file, this could work. In fact I have already planned to implement it later.

fmarionpoll commented 3 years ago

Thank you Vincent. Did not yet find how reading hyperslabs is related to H5BinaryReader as I am still not too familiar with C#.

The hyperslab Read routine dots to H5Dataset. In H5DataSet, Chunk data are read through "H5D_Base bufferProvider", through "LayoutClass.Chunked => H5D_Chunk.Create(this, datasetAccess)". I guess that data are read though a call to H5D_Chunk.ReadChunk (or H5D_Chunk.Stream ?). ReadChunk in turn is calling this.Dataset.Context.Reader.Read(buffer.Span); or using var filterBufferOwner = MemoryPool.Shared.Rent((int)rawChunkSize); var filterBuffer = filterBufferOwner.Memory[0..(int)rawChunkSize]; this.Dataset.Context.Reader.Read(filterBuffer.Span); H5Filter.ExecutePipeline(this.Dataset.InternalFilterPipeline.FilterDescriptions, filterMask, H5FilterFlags.Decompress, filterBuffer, buffer);

Did not find yet where and how Dataset.Context.Reader is created and how it relates to H5BinaryReader. ..

Apollo3zehn commented 3 years ago

The context encapsulates the binary reader and the superblock as both are needed very frequently. The reader is created when the file is opened ( https://github.com/Apollo3zehn/HDF5.NET/blob/master/src/HDF5.NET/Core/H5File.cs#L52 ) and the context is created some lines later. The reader is used to read all metadata to locate the chunk that is about to be read. This process requires multiple read operations at different positions in the file. Each read operation potentially causes a slow system call. If you change the implementation of H5BinaryReader - which currently derives from BinaryReader - to the memory mapped file approach, there will be hopefully much less system calls and your chunks are found quicker. I am not sure how much work such an implementation requires but it should be doable. The right place would be the H5BinaryReader. To mimic the current behaviour, you would need to reimplement some BinaryReader functions like ReadUint32 and similar methods. If you remove the base class from the H5BinaryReader and then try to compile the error list would give you a hint of things to reimplement. Maybe I find the time to implement a prototype the next days, unfortunately I am quite busy right now.

Apollo3zehn commented 3 years ago

OK I found a VERY simple solution (branch "feature/simple-mmf": https://github.com/Apollo3zehn/HDF5.NET/commit/d6e9193d5bc2f6d2bc8a08f34c45aed01737a1a7

I guess this implementation is still not thread-safe but at least the number of system calls should be reduced. Could you please test if there are any performance improvements? Additionally, it could be that the memory mapped file must be disposed properly to free file access after usage. But that would be easy to implement.

Apollo3zehn commented 3 years ago

You actually do not need to modify the source code, just create the memory mapped file stream first (https://github.com/Apollo3zehn/HDF5.NET/blob/d6e9193d5bc2f6d2bc8a08f34c45aed01737a1a7/src/HDF5.NET/Core/H5File.cs#L43-L46) and then pass it to the file open method (https://github.com/Apollo3zehn/HDF5.NET/blob/master/src/HDF5.NET/API/H5File.cs#L44)

fmarionpoll commented 3 years ago

yes, have just compiled it on my local copy/fork and I ran it. The time necessary to read a whole file as initially did not reduce substantially - previously, reading 130 s of 1 channel sampled at 20 kHz took between 25 and 29 s. With the modification, it takes between 24 and 29 s - there seem to be a slight improvement. I will check tomorrow if this version is more compatible with threads. One remark though: while reading data, the garbage collector (?: GC - yellow on the process memory band in Visual studio), is working constantly (in this and in the previous version). Will continue working on it tomorrow.

PS my code to read: public ushort[] ReadDataForOneElectrode(ElectrodeProperties electrodeProperties) { Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); var result1 = fileReader.ReadAll_OneElectrodeAsInt(electrodeProperties); stopwatch.Stop(); Trace.WriteLine("Elapsed time -direct- is " + (stopwatch.ElapsedMilliseconds / 1000).ToString("0.###") + " s"); return result1; }

HDF5.NET fork: internal static H5File OpenCore(string filePath, FileMode fileMode, FileAccess fileAccess, FileShare fileShare, bool deleteOnClose = false) { //var absoluteFilePath = System.IO.Path.GetFullPath(filePath); //var stream = System.IO.File.Open(absoluteFilePath, fileMode, fileAccess, fileShare);

        //return H5File.OpenCore(stream, absoluteFilePath, deleteOnClose);
        var absoluteFilePath = System.IO.Path.GetFullPath(filePath);
        var fileStream = System.IO.File.Open(absoluteFilePath, fileMode, fileAccess, fileShare);

        var mmf = MemoryMappedFile.CreateFromFile(fileStream, null, 0, MemoryMappedFileAccess.Read, HandleInheritability.None, leaveOpen: true);
        var mmfStream = mmf.CreateViewStream(0, 0, MemoryMappedFileAccess.Read);

        return H5File.OpenCore(mmfStream, absoluteFilePath, deleteOnClose);
    }
Apollo3zehn commented 3 years ago

Do you have a sample file I could test against? If MMF is not helping, reducing the number of GCs would be the next step, either by better caching or by using more stack variables. I will have some time on Monday to test it.

fmarionpoll commented 3 years ago

Currently, my test data file is here (2.1 Gb). Data are under the directory "sig": https://filesender.renater.fr/?s=download&token=0c373b20-a256-430c-b9e8-a74b0281cbbd

In the meantime, I'll do further tests to make sure the change I made to HDF5.NET is really used, etc.

PS just a thought: the data here are chunk of 200 words x 1028 channels and compressed. May be it would also help to rewrite such data with a different chunk format.

fmarionpoll commented 3 years ago

Hi, I have integrated reading data from the file mentioned in the earlier message into my HDF5.NET fork as a test, and looked at how much time is spent reading a chunk and deflating it, using dotTracer (new to me). Actually, it seems deflating takes the most time in HDF5.NET (DeflateFilterFunc = 24 s, 7.1%), followed by ReadChunk (6 s, 1.7%)., while 91% of the time (316 s) is spent in "stack traves without user methods". Reading 1 channel (n° 863) takes 32 s.

Here is the test: using System; using System.Threading.Tasks; using Xunit; using Xunit.Abstractions;

namespace HDF5.NET.Tests { public class ReadH5MaxwellFileTests { private readonly ITestOutputHelper _testOutputHelper; private static H5File Root { get; set; } public record ElectrodeProperties( int Electrode, int Channel, double XuM, double YuM);

    public ReadH5MaxwellFileTests(ITestOutputHelper testOutputHelper)
    {
        _testOutputHelper = testOutputHelper;
    }

    [Fact]
    public void OpenAndReadH5MaxwellFileTest()
    {
        var localFilePath = "E:\\2021 MaxWell\\Trace_20210715_16_54_48_1mM(+++).raw.h5";
        if (!OpenReadMaxWellFile(localFilePath))
            throw new Exception();

        var electrodeProperties = new ElectrodeProperties(0, 683, 0, 0);
        var result1 = ReadAll_OneElectrodeAsInt(electrodeProperties);
    }

    public bool OpenReadMaxWellFile(string fileName)
    {
        Root = H5File.OpenRead(fileName);
        return Root != null;
    }

    public ushort[] ReadAll_OneElectrodeAsInt(ElectrodeProperties electrodeProperties)
    {
        H5Group group = Root.Group("/");
        H5Dataset dataset = group.Dataset("sig");
        int ndimensions = dataset.Space.Rank;
        if (ndimensions != 2)
            return null;
        var nbdatapoints = dataset.Space.Dimensions[1];
        return Read_OneElectrodeDataAsInt(dataset, electrodeProperties.Channel, 0, nbdatapoints - 1);
    }

    public ushort[] Read_OneElectrodeDataAsInt(H5Dataset dataset, int channel, ulong startsAt, ulong endsAt)
    {
        var nbPointsRequested = endsAt - startsAt + 1;
        var datasetSelection = new HyperslabSelection(
            rank: 2,
            starts: new[] { (ulong)channel, startsAt },         // start at row ElectrodeNumber, column 0
            strides: new ulong[] { 1, 1 },                      // don't skip anything
            counts: new ulong[] { 1, nbPointsRequested },       // read 1 row, ndatapoints columns
            blocks: new ulong[] { 1, 1 }                        // blocks are single elements
        );

        var memorySelection = new HyperslabSelection(
            rank: 1,
            starts: new ulong[] { 0 },
            strides: new ulong[] { 1 },
            counts: new[] { nbPointsRequested },
            blocks: new ulong[] { 1 }
        );

        var memoryDims = new[] { nbPointsRequested };
        var result = dataset
            .Read<ushort>(
                fileSelection: datasetSelection,
                memorySelection: memorySelection,
                memoryDims: memoryDims
            );

        return result;
    }

}

}

fmarionpoll commented 3 years ago

image Here is the analysis by dotTrace of the call to the channel via hyperslabs. I understand from this figure that reading one channel takes 31 s, of which most is taken by Read operations. Within the Read, most of the time is taken by copy operations (31.78 s), which is spending quite some time in deflating the data: image

If this is interpretation is correct, then a good portion of the time reading such data is taken by the decompression, meaning the using threads may help to reduce the time reading it.

fmarionpoll commented 3 years ago

Garbage collection is a minor part of the time spent. See below. Most of the time is spent decompressing within the function DeflateFilterFunction.

image

Not sure however if the functions called within DeflateFilter are include in File I/O: image

If not, then most of the time is spent accessing the disk and software improvements might only bring marginal improvement.

Apollo3zehn commented 3 years ago

Thats an interesting analysis. So the most time consuming line is this one: https://github.com/Apollo3zehn/HDF5.NET/blob/master/src/HDF5.NET/Filters/H5Filter.cs#L165

You could replace the current implementation based on Microsoft's deflate algorithm by using the one from "SharpZipLib".

It works similar as shown here for BZip2 (https://github.com/Apollo3zehn/HDF5.NET/blob/master/tests/HDF5.NET.Tests/Utils/BZip2Helper.cs). Maybe their implementation is faster i.e. hopefully they use vectorized methods (hardware intrinsics, https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/).

I agree that multithreading would be useful here to further improve performance. Unfortunately that cannot be implemented easily right now except by opening the file multiple times, and divide the read process in e.g. 12 sub reads to use each core. Or by using memory mapped files inside the H5BinaryReader as suggested above. I tried to do such implementation yesterday but that required a bit more work and I had no more time left. To be clear: you can use a memory mapped file either as stream (the solution shown above) or using a ViewAccessor which allows random access, i.e. there is no file position variable that tracks the current read progress. This random access without state variable would probably allow multi threading (because its stateless). At the same time this requires some changes in the HDF5 lib so make use of that advantage. I.e. there are some "Seek" calls to the H5BinaryReader that don't make sense in combination with the ViewAccessor.

I will try some of these solutions on Monday. Thanks for your trace analysis :-)

fmarionpoll commented 3 years ago

While this is not relevant to using threads or not, looking at the code in H5Dataset.cs shows that all arrays are used and defined as ulong. I did not check how data are written there, but since the stream of data from this particular acquisition data is ushort, we may introduce further delays in copying data from ushort to ulong arrays and then back to ushort arrays?

Apollo3zehn commented 3 years ago

I am not sure which line of code you refer to, but all data arrays are defined as byte[], Memory or Span or as T[] in case of generic methods. Ulong arrays are used to store the dataset dimensions.

I have had a look to SharpZipLib and DotNetZip and both are not using hardware intrinsics but promise much better performance than Microsoft's deflate implementation.

However I found a paper from Intel (https://ieeexplore.ieee.org/document/8712745/authors) and the corresponding source code (https://github.com/intel/isa-l/blob/94ec6026ce5ec9d163b3552190cdc3d26ffb09ab/igzip/igzip.c#L1513).

This looks promising to get high performance on Intel processors. I will also test that on Monday.

fmarionpoll commented 3 years ago

Oh, thanks for the reference to de deflate filter! This is indeed exciting.

Sorry if I have been mislead with the use of ulong. Apologies if I made mistakes! Please don't feel obliged to explain if I am wrong. I will further search.

I looked at the file H5DDataset.cs (https://github.com/Apollo3zehn/HDF5.NET/blob/master/src/HDF5.NET/Core/H5Dataset.cs) and at this: 1) memoryDims ulong[] _ line 93: internal T[]? Read( Memory buffer, Selection? fileSelection = default, Selection? memorySelection = default, ulong[]? memoryDims = default, H5DatasetAccess datasetAccess = default, bool skipTypeCheck = false, bool skipShuffle = false) where T : unmanaged

2) define arrays ulong[] - line 183 (I am not familiar with these notations & I might be wrong: I need to check that): Func<ulong[], Memory>? getSourceBuffer = bufferProvider.SupportsBuffer ? chunkIndices => bufferProvider.GetBuffer(chunkIndices) : null;

        Func<ulong[], Stream>? getSourceStream = bufferProvider.SupportsStream
            ? chunkIndices => bufferProvider.GetStream(chunkIndices)
            : null;

3) lines 230+ - it looked to me that arrays needed to be ulong, namely datasetDims, datasetChunkDims, memoryDims : / copy info / var copyInfo = new CopyInfo( datasetDims, datasetChunkDims, memoryDims, memoryDims, fileHyperslabSelection, memoryHyperslabSelection, GetSourceBuffer: getSourceBuffer, GetSourceStream: getSourceStream, GetTargetBuffer: indices => buffer.Cast<T, byte>(), TypeSize: (int)this.InternalDataType.Size ); HyperslabUtils.Copy(fileHyperslabSelection.Rank, memoryHyperslabSelection.Rank, copyInfo);

Apollo3zehn commented 3 years ago

You wrote: |> it looked to me that arrays needed to be ulong, namely datasetDims, datasetChunkDims, memoryDims

And they are indeed of type ulong[]:

grafik

grafik

grafik

All three arrays only carry the dataset dimensions, not the data itself. They are not really performance relevant and they need to be ulong because a dataset might be that large (if I have the HDF5 spec in mind correctly).

I tried to always use byte arrays or generic T arrays to explicitly avoid the conversion costs for the data itself.

The following syntax:

Func<ulong[], Memory<byte>>?

means that the return value is a function that takes a ulong[] as input parameter (in this case the chunk index to read) and which then returns a Memory<byte> (which is similar to byte[]).

The method Func<ulong[], Stream>? does the same except that the return value of the function is a stream which operates only byte[] only. Streams do not support other types.

I hope this clears this up a bit :-)

Apollo3zehn commented 3 years ago

So Func<ulong[], Memory<byte>>? translates to

Memory<byte>>? MyFunctionName(ulong[] chunkIndices)
{
   ...
}

The ? means that the return value might be null.

fmarionpoll commented 3 years ago

thank you! (&& sorry)

fmarionpoll commented 3 years ago

Your suggestion to reopen the file was very helpful. With the code below, one channel is read in 6 to 8 s INSTEAD OF 28 to 30 s on my computer! it is helping also on my laptop (7 to 10 s). This is a whopping improvement!!

Here is the source: public ushort[] ReadAll_OneElectrodeAsIntParallel(ElectrodeProperties electrodeProperties) { var h5Group = Root.Group("/"); var h5Dataset = h5Group.Dataset("sig"); var nbdatapoints = h5Dataset.Space.Dimensions[1]; const ulong chunkSizePerChannel = 200; // where can I get.read this parameter? var result = new ushort[nbdatapoints]; var nchunks = (long)(nbdatapoints / chunkSizePerChannel);

        int ndimensions = h5Dataset.Space.Rank;
        if (ndimensions != 2)
            return null;

        Parallel.For(0, nchunks, i =>
        {
            var fileName = FileName;
            var lRoot = H5File.OpenRead(fileName);
            var lgroup = lRoot.Group("/");
            var ldataset = lgroup.Dataset("sig");

            var istart = (ulong)i * chunkSizePerChannel;
            var iend = istart + chunkSizePerChannel - 1;
            if (iend > nbdatapoints)
                iend = nbdatapoints - 1;
            var chunkresult = Read_OneElectrodeDataAsInt(ldataset, electrodeProperties.Channel, istart, iend);
            Array.Copy(chunkresult, 0, result, (int)istart, (int)(iend - istart + 1));
            lRoot.Dispose();
        });

        return result;
    }

public ushort[] Read_OneElectrodeDataAsInt(H5Dataset dataset, int channel, ulong startsAt, ulong endsAt) { var nbPointsRequested = endsAt - startsAt + 1;

        //Trace.WriteLine($"startsAt: {startsAt} endsAt: {endsAt} nbPointsRequested={nbPointsRequested}");

        var datasetSelection = new HyperslabSelection(
            rank: 2,
            starts: new[] { (ulong)channel, startsAt },         // start at row ElectrodeNumber, column 0
            strides: new ulong[] { 1, 1 },                      // don't skip anything
            counts: new ulong[] { 1, nbPointsRequested },       // read 1 row, ndatapoints columns
            blocks: new ulong[] { 1, 1 }                        // blocks are single elements
        );

        var memorySelection = new HyperslabSelection(
            rank: 1,
            starts: new ulong[] { 0 },
            strides: new ulong[] { 1 },
            counts: new[] { nbPointsRequested },
            blocks: new ulong[] { 1 }
        );

        var memoryDims = new[] { nbPointsRequested };
        var result = dataset
            .Read<ushort>(
                fileSelection: datasetSelection,
                memorySelection: memorySelection,
                memoryDims: memoryDims
            );

        return result;
    }
Apollo3zehn commented 3 years ago

That's good news!

I was successful today to compile the Intel ISA-L library and do a simple inflate (= reversed deflate = uncompression). Here is the code that works (https://github.com/Apollo3zehn/Intel.ISA-L.PInvoke/blob/main/tests/Intel.ISA-L.PInvoke.Tests/PInvokeTests.cs#L43-L93).

Tomorrow I will extend this and write a benchmark comparing Microsofts implementation, SharpZipLib and Intel's solution.

Apollo3zehn commented 3 years ago

I have prepared the Nuget package Intrinsics.ISA-L.PInvoke which contains P/Invoke signatures to use the vectorized deflate algorithm from Intel. To understand the performance improvement, I created a benchmark. The results can be found here. Depending on the buffer size, Intel's implemenation is 2-15 times faster than the one from Microsoft. Suprisingly the one from SharpZipLib is always the slowest.

To use Intel's deflate implementation, just follow the short guide here.

But make sure your are using the newest version of HDF5.NET because I fixed a bug that prevented to re-register a filter function. With that bug active you would be unable to replace Microsoft's deflate algorithm. Either use the current Git release or the newest Nuget package.

Note: the helper function I have prepared (https://github.com/Apollo3zehn/HDF5.NET/blob/master/tests/HDF5.NET.Tests/Utils/DeflateHelper_Intel_ISA_L.cs) is not thread-safe because there is a single static inflate_state struct. To make it thread safe, make sure there is one struct per thread. However, the struct is VERY large (~80kb), so keep the number of threads low. Best would be to use one thread per CPU core instead of one thread per chunk! Otherwise there will be no performance improvement.

I have applied the new algorithm to your file (single threaded). Here are the results:

Microsoft DeflateStream

Total time: 34.5 s

Microsoft

Intel ISA-L

Total time: 16.4 s

Intel

As you can see, some time is spent on "System.Buffer._Memmove". To further optimize performance, we must reduce the number of copy operations (example 1, example 2). But I have no idea how, currently. The main problem is that the deflate stream does not store its uncompressed size. So we cannot preallocate an array with correct size and must do some copy operations. One option would be to keep all created arrays and copy them only once to the final array (whichs size is known). That could be done using the recently introduced ReadonlySequence<T>. But thats nothing I will implement soon, but later when I make more in-deep performance analyses.

fmarionpoll commented 3 years ago

Quite impressive! very well done!

fmarionpoll commented 3 years ago

On my test machine, reading one data channel (3 10e6 points) takes 27-28 s without the Intel filter and 17-18 s with the Intel filter. I tried to use the memory mapped approach, with no significant change whether the file is on a regular hard disk or on an SSD.

However, all this is way more than when using one thread per Chunk (ie not optimized), which results into reading the same data in 6-7 s.

PS I don't know (yet) how to make one "inflate_struct" per thread. PS how to un-register the Intel filter?

Apollo3zehn commented 3 years ago

I pushed an update for the DeflateHelper to the dev branch. The code is completely untested therefore its not yet part of the master branch. Hopefully the Intel filter + multithreading is giving you the best read performance.

Apollo3zehn commented 3 years ago

The idea is to use the ThreadLocal<T> class to ensure one state struct per thread.

Apollo3zehn commented 3 years ago

I'll create an update to allow unregistering filters. That's a use case I did not think about.

fmarionpoll commented 3 years ago

I pushed an update for the DeflateHelper to the dev branch. The code is completely untested therefore its not yet part of the master branch. Hopefully the Intel filter + multithreading is giving you the best read performance.

The DeflateHelper works very well. Now, the figures are 27 s for a full read (17 s with filter) and down to 4 s with threads.

The size of the chunks has a moderate influence on the time necessary to read the file. With chunk = 200 x 1028 channels , I get: 1 thread / 1 chunk = 5.1 s 1 thread/ 10 chunks = 4.3 s 1/50 = 4.1 1 thread / 100 chunks = 4 s 1/500 = 4.5 s 1 thread / 1000 chunks = 4.7 s 1 thread / 10000 chunks = 13.9 s

Apollo3zehn commented 3 years ago

I think this is a great overall performance improvement. To get even more out of the system we would need to do more in-deep investigations I think. At least for me this is a task for the future :-) thank you for testing my suggested changes. This also gave some more insight into multi threading requirements and possibilities.

fmarionpoll commented 3 years ago

Was fun, actually. And it is me who is thankful!!! I learned a lot thanks to your suggestions. I would never have thought that so much could be gained with an optimized decompressor.

Apollo3zehn commented 1 year ago

This is a message I post to all recent issues: I have just renamed the project from HDF5.NET to PureHDF for my preparations of a soon to come beta release. Please note that the Nuget package name has also changed and can be found here now: https://www.nuget.org/packages/PureHDF.