Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

System.OverflowException #29

Closed ReikanYsora closed 1 year ago

ReikanYsora commented 1 year ago

Hi,

I encounter this error :
System.OverflowException at (wrapper managed-to-native) System.Object.__icall_wrapper_ves_icall_array_new_specific(intptr,int) at PureHDF.VFD.H5StreamDriver.ReadBytes (System.Int32 count) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VFD/H5StreamDriver.cs:87 at PureHDF.VOL.Native.HeaderMessage..ctor (PureHDF.NativeContext context, System.Byte version, PureHDF.VOL.Native.ObjectHeader objectHeader, System.Boolean withCreationOrder) [0x003a2] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/HeaderMessage.cs:74 \r\n at PureHDF.VOL.Native.ObjectHeader.ReadHeaderMessages (PureHDF.NativeContext context, System.UInt64 objectHeaderSize, System.Byte version, System.Boolean withCreationOrder) [0x0003d] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader.cs:116 at PureHDF.VOL.Native.ObjectHeader1..ctor (PureHDF.NativeContext context, System.Byte version) [0x00068] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader1.cs:36 at PureHDF.VOL.Native.ObjectHeader.Construct (PureHDF.NativeContext context) [0x00055] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/ObjectHeader.cs:81 at PureHDF.NativeNamedReference.Dereference () [0x00058] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeNamedReference.cs:57 at PureHDF.VOL.Native.NativeGroup.Get (System.String path, PureHDF.VOL.Native.H5LinkAccess linkAccess) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeGroup.cs:90 at PureHDF.VOL.Native.NativeGroup.Get (System.String path) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/VOL/Native/Core/NativeGroup.cs:80 at PureHDF.IH5GroupExtensions.Group (PureHDF.IH5Group group, System.String path) [0x00000] in /home/runner/work/PureHDF/PureHDF/src/PureHDF/API/IH5GroupExtensions.cs:82 at HDF5Reader.Group (System.String groupPath) [0x00001] in C:\Users\CRJE160\Documents\Git\DragonflyPlayer-ADS\Assets\Scripts\HDF5.NET\HDF5Reader.cs:82

The H5 group I'm reading has a size that exceeds the maximum value of an Int32 (10936 5174 64 = 3 623 621 248). I think the problem is on StreamReader's ReadBytes methode, this method takes an Int32, not a long / Int64.

Apollo3zehn commented 1 year ago

From the stack trace it looks more like a negative value which get passed to the array constructor. Line 72 below shows that it is possible to get a negative value, but I am not 100% sure.

https://github.com/Apollo3zehn/PureHDF/blob/997ab2c930f7e7bb23a2e266dec91ef90a8b38e6/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/HeaderMessage.cs#L71-L74

Do you have a sample file with which I can reproduce the problem?

Apollo3zehn commented 1 year ago

Ok, maybe you are right, the statement (int)paddingBytes may do something wrong if the value of paddingBytes is > int32.

ReikanYsora commented 1 year ago

Unfortunately, I can't transfer an example file to you as I'm working on data that unfortunately can't be shared. I will try to request the generation of an H5 that has this problem with fake data. I'll also try to make some changes to check that replacing with a int64 one works.

Apollo3zehn commented 1 year ago

That would be great, thanks :-)

ReikanYsora commented 1 year ago

Okay... I'm going crazy. I'm on a file that contains a 1-dimension dataset of size 10,943 with a 64-bti floating-point datatype. I have EXACTLY the same dataset in 2 files with EXACTLY the same data. The first one works, not the second one which has exactly the same StackOverflow Exception...

In this case, I'm on a data volume of 1 10943 64 bits... which is an extremely small value, and yet I have an overflow. I don't understand...

Apollo3zehn commented 1 year ago

If you can debug to these lines https://github.com/Apollo3zehn/PureHDF/blob/997ab2c930f7e7bb23a2e266dec91ef90a8b38e6/src/PureHDF/VOL/Native/FileFormat/Level2/Level2A1/HeaderMessage.cs#L71-L74

then I think you will quickly find out the difference between both files.

ReikanYsora commented 1 year ago

A first problem is effectively solved when all abstract and inherited classes have the index parameter in long rather than int32. There is still a problem however, from time to time, line 72 actually returns a negative value.

I'm still working to send you a file in which the problem occurs.

Can you send me an email address to which I can send you this file?

ReikanYsora commented 1 year ago

(i have created a fork with only this project for test and debug, without .net 6.0 support, i can compile with 6.0 framework on my workstation)

ReikanYsora commented 1 year ago

I just implemented a unit test with a problematic file. Indeed, I confirm that when reading, I end up with a negative value that poses a problem, because behind, we try to create a buffer of the size of this value, which, in my case is -4

The problematic value is used here : H5FileHandleDriver line 67

System.OverflowException : 'Arithmetic operation resulted in an overflow.'

I continue to investigate

ReikanYsora commented 1 year ago

Just a question. I need to find the min and max values of a 2Dimension dataset. It takes me 10 minutes to read the dataset, whereas tools like hdfview take a few milliseconds to read the same volume of data. What's the best way, in .net 2.1, to read this dataset as quickly as possible?

My dataset is an 2dimentional double (60000/500)

Thanks a lot for your help

Apollo3zehn commented 1 year ago

The normal way would be to simply call dataset.Read<T>(). It should only be slow if your chunk dimensions are strange or if you run it in Debug mode with one or more conditional breakpoints. In any other case it should read quite fast. If it is still slow, I would be interested in a sample file to investigate.

ReikanYsora commented 1 year ago

image

This is my dataset : 8261 x 361 (x 64 bits) => 190 862 144‬ bits

This is my code. I create a loop for read 6 dataset in the same group. For each group, i need to know the min/max value of each dataset. I store each result in a dictionary with dataset name for key value and a Vector2 for Min and Max value.

I take 13 minutes, just for the first loop. More than an hour for create the full dictionary.

Dictionary<string, Vector2> results = new Dictionary<string, Vector2>();
INativeFile root = H5File.Open( Path, FileMode.Open, FileAccess.Read, FileShare.Read, false);
IH5Group group = root.Group("XXXXX"); //XXXXX is not the real value, of course :)
List<IH5Dataset> datasets = group.Children().Where(x => x is IH5Dataset).Select(x => (IH5Dataset)x).ToList();

foreach (IH5Dataset tempDataset in datasets)
{
    double[] tempReadDatas = tempDataset.Read<double>();
    float min = (float)tempReadDatas.Min();
    float max = (float)tempReadDatas.Max();

    results.Add(tempDataset.Name, new Vector2(min, max));
}

return results;

When app is build, this code need more than 30 minutes+ to read 6 datasets... I'm going crazy !

I test many differents solution :

My works use sensitive data. I can't share my file :/

Apollo3zehn commented 1 year ago

Thanks for the info, I do not see anything wrong here. I will have a look into it tomorrow morning.

I cannot find a version for .NET Framework 2.1 (https://de.wikipedia.org/wiki/.Net-Framework#.NET_Framework_2.0) or do you mean .NET Core 2.1?

Could you test your code on a new .NET version and see if it still slow there?

Apollo3zehn commented 1 year ago

There are 361 read operations per dataset because of the chunk size ... maybe your system does not cache the file access? Or maybe you need to increase the chunk cache size of PureHDF from 1 MB to something higher like this:

root.ChunkCacheFactory = new SimpleChunkCache(byteCount: x * 1024 * 1024);
ReikanYsora commented 1 year ago

I work with ".Net standard 2.1"

Thanks for your answer. I will test tomorrow this code in different versions of .Net Framework on another machine

Apollo3zehn commented 1 year ago

Would you be able to share a sample file if I add a small console application to the project which takes a copy of an HDF5 file and replaces all data (datasets and attributes) with zeros and randomizes all link names? So instead of my-link the file would contain e.g. qU/J8UA.

If yes, would it be sufficient if I just add it to the master branch on git and you compile it yourself?

I plan to enable this functionality using a build variable, so you would need to enter something like dotnet run src/Anonymize/Anonymize.csproj /p:ENABLE_ANONYMIZE=true to run the console application. I hope this is fine to you :-)

ReikanYsora commented 1 year ago

This is a great idea.

Thank you for your help and for the quality of your support in any case. I continue my tests this morning!

ReikanYsora commented 1 year ago

I confirm that i have the same issue on another laptop in a .NetFramework 4.7.2 console project with the same file.

This syntax don't work :

root.ChunkCacheFactory = new SimpleChunkCache(byteCount: 8 * 1024 * 1024);

Apollo3zehn commented 1 year ago

Sorry, my fault. It is root.ChunkCacheFactory = () => new SimpleChunkCache(byteCount: 8 * 1024 * 1024); Hopefully I will have the "Anonymize" application ready today evening.

ReikanYsora commented 1 year ago

Thanks ! with 128 1024 1024, my loading take a few seconds

Apollo3zehn commented 1 year ago

Good that this helped.

Sorry, I won't be able to provide the data clearing application today :-( I will try again tomorrow.

ReikanYsora commented 1 year ago

No problem at all. Thank you so much for your help. This solution has already saved me a bit of time. I'll keep looking at it when I have a bit of time too.

Apollo3zehn commented 1 year ago

I have worked now on the anonymizer on this branch: https://github.com/Apollo3zehn/PureHDF/tree/feature/anonymizer

It can be run using the command dotnet run /src/PureHDF.Anonymizer/PureHDF.Anonymizer.csproj /property:Anonymize=true. A console window will open where you can enter the source file path. The process might be slow due to the same reasons as before (the chunk cache is too small). You might modify the line where the file is opened to replace the chunk cache factory (as described earlier) here:

https://github.com/Apollo3zehn/PureHDF/blob/57a7461d7faad8ac31bc0e0841300c8c664edecd/src/PureHDF.Anonymizer/Program.cs#L45

I have tested it on a small and a large file with multiple groups and attributes but I cannot promise that this program is already error free. You need .NET 7 to run the application or change this line

https://github.com/Apollo3zehn/PureHDF/blob/57a7461d7faad8ac31bc0e0841300c8c664edecd/src/PureHDF.Anonymizer/PureHDF.Anonymizer.csproj#L4

to the framework version you are using.

The anonymizer will create a copy, then scan your file and then replace all data with random data or zeros.

HDF View should still be able to open the file, however opening datasets will not be possible if they were compressed as the decompressing function gets confused by all the zeros.

It would be great if you could give it a try. It is not a problem if it is not working as your main problem seems to be solved. It would just be useful for me to understand the OverflowException and to find ways to further improve read performance.

ReikanYsora commented 1 year ago

Thank you very much, I will test this and try to generate an anonymous file.

Anyway I still have the problem of overflow, and I even have a very small file that encounters this problem. I'll try to see if I can share it with you without "anonymization"

ReikanYsora commented 1 year ago

I have exactly the same issue when i load my file in the "Anonymizer" :

image

image

Sorry, i'm french, so my IDE is in french too :)

Apollo3zehn commented 1 year ago

You are right, I forgot that you will still suffer from that overflow problem. It would be great if you can share a file. I am sure I will be able to quickly fix that problem.

ReikanYsora commented 1 year ago

Thank you so much

Apollo3zehn commented 1 year ago

Downloaded :-)

ReikanYsora commented 1 year ago

;)

Apollo3zehn commented 1 year ago

That was helpful. I guess this file is produced by system which uses a rather old version of the HDF5 library because internally the file uses version 2 of the "data layout message" where I am only able to produce version 3 of that message for testing. I.e. I wrote code to parse version 2 of that message but could never test it.

However, I have now a test for it, so this should not happen again. The error is resolved in the dev and feature/anonymizer branches.

I hope you are now able to read your file properly :-)

ReikanYsora commented 1 year ago

Thank you so much ! It works !!!

You're awesome :)