F2I-Consulting / fesapi

DevKit for ENERGISTICS™ data standards (mainly RESQML™), multi-languages (C++, Java, C#, Python)
Apache License 2.0
32 stars 24 forks source link

Enable an HDF chunk setter in HDF Proxy #279

Closed philippeVerney closed 3 years ago

philippeVerney commented 3 years ago

Is your feature request related to a problem? Please describe. My reader wants to read a compressed HDF dataset part by part, 1Mb by 1Mb for example. But FESAPI compress an HDF dataset in a single giant chunk (up to 4Gb) which forces the reader to uncompress too many times the whole dataset (at each 1Mb read). The performance penalty is critical.

Describe the solution you'd like To be able to set the chunk size of a dataset at writing time

Describe alternatives you've considered Do not compress the dataset and read in contiguous order.

Additional context See https://discourse.f2i-consulting.com/t/hdf-maximum-chunk-size/173

kmgoss commented 3 years ago

I would like to see this (as the issue submitter).

This link has some good points about the sensitivity of reader performance to chunk size: https://www.star.nesdis.noaa.gov/jpss/documents/HDF5_Tutorial_201509/2-2-Mastering%20Powerful%20Features.pptx.pdf

kmgoss commented 3 years ago

Before we upgraded to fesapi v0.16 a while back, we used to call a method setMaxChunkSize() in the HdfProxy class. With version 0.16 that method disappeared. Restoring that method would be the easiest way for us to set the max chunk size in a fesapi version 1.3.

philippeVerney commented 3 years ago

I am trying to find back this method but cannot succeed. Could you help me and point me to what has disappeared please?

FYI, here is the official fesapi v0.16 : https://github.com/F2I-Consulting/fesapi/tree/v0.16.0.0 I cannot find this method in https://github.com/F2I-Consulting/fesapi/blob/v0.16.0.0/src/common/HdfProxy.h neither in https://github.com/F2I-Consulting/fesapi/blob/v0.16.0.0/src/common/AbstractHdfProxy.h

It is not very important but I cannot remember this method and for my curiosity (and a bit for traceability) I would be happy to look when and maybe why it has disappeared at one point.

kmgoss commented 3 years ago

The method in question had disappeared in or before v0.16.

Our previous version of fesapi before v0.16 was v0.8. In v0.8 it is present in resqml2::HDF5Proxy.h.

The method is void setMaxChunkSize(const unsigned int & newMaxChunkSize) { maxChunkSize = newMaxChunkSize; } and I can see maxChunkSize accessed in void resqml2::HdfProxy::writeArrayNd()

I too am curious why it disappeared.

As you know, in v1.2.1 you now have a constant MAX_CHUNK_SIZE instead of a variable - perhaps a variable maxChunkSize could be used in place of that again?

philippeVerney commented 3 years ago

Thanks @kmgoss

My previous forge (before github) has stopped and disappeared and I don't even know if I have somewhere a backup of the SVN.... It is going to be too much time consuming for me to go before v0.11 which has not the chunk method... Pity..

Anyhow, yes the idea is clearly to replace MAX_CHUNK_SIZE by a variable maxChunkSize.

kmgoss commented 3 years ago

We need to be able to set the max chunk size by the end of this quarter. We won't be upgrading to fesapi v2.0 in that time.

So we can either wait for a fesapi v1.3 with this feature in or we can modify the source code of our current version v1.2.1.1 in the manner suggested above. The former is preferable naturally.

Sorry to press but I need to have an idea for our planning - do you think we will have a v1.3 with max chunk setting in that time frame?

philippeVerney commented 3 years ago

Hi @kmgoss

I cannot guarantee anything but if you ask me my idea : I think it is going to be done before the end of this quarter. The only way you can really influentiate the FESAPI priorities is for your company to join the FESAPI initiative and vote for a particular task. The FESAPI initiative should work on FESAPI in February or maximum March. Your vote there should have strong power on priorities. Without the FESAPI initiative, I can just try to fix issues when I find some "free" time.

On the other hand, feel very free to push PR on v2.0 (master branch) and the corresponding backport in v1.2 (v1.2.1.0 branch or now nextv1 branch) if you work on it.

Sorry not to be able to answer "Yes of course, I do it right now" even if a part of myself would want.

philippeVerney commented 3 years ago

hi @kmgoss,

I tried to solve your issue in this branch https://github.com/F2I-Consulting/fesapi/tree/nextv1 I did some few test on c++ side and it looks like working as expected.

Could you give an eye to this branch and let me know if I have well understood your need?

FYI, please notice that FESAPI will now by default write chunk with a maximum size of 1 MB. So, your code should remain exactly the same but the output should be different. The 1 MB limit is indeed recommended by your consuming developers probably because this is the default chunk cache size in HDF5 library which I consequently want to honor.

kmgoss commented 3 years ago

Hello @philippeVerney

Thank-you for that. We have inspected the code and it certainly seems to do what we need. We are currently on fesapi v1.2.1.0 (with no plans to upgrade in the foreseeable future) so we would need the same fix in a version 1.3 with it in before we would be able to utilise it.

Thanks again

philippeVerney commented 3 years ago

Fixed in v1.2.2.0