Vindaar / nimhdf5

Wrapper and some simple high-level bindings for the HDF5 library for the Nim language
MIT License
28 stars 2 forks source link

Change default opening / closing behavior of wrapper #14

Open Vindaar opened 6 years ago

Vindaar commented 6 years ago

We need a nicer way to close each H5 object individually, without calling the native H5 functions.

In addition implement closing of groups and datasets by default after a read / write etc. procedure. We can introduce some locking flag, which allows us to keep objects open, if the user desires. In some cases that might be useful, if one knows that several successive writes / reads of the same dataset will happen.

Vindaar commented 6 years ago

Very much related: it's probably not a good idea to open each group and especially dataset, which we encounter, when visiting the whole file. And getting the dataspace immediately is useless, too (I think...).

In principle we only need to open the dataset, when we actually access the data in it. For everything else, we just need to keep track of the information of the dataset in the file, e.g. datatype, shape etc. That's why we have an abstract interface in the first place...

Vindaar commented 6 years ago

The file https://github.com/Vindaar/nimhdf5/blob/master/tests/tDebugRamUsage.nim just visits a large file (which currently opens each dataset, group and attribute in it) and waits a few seconds. Running it on a 30GB h5 file it outputs:

Visiting file...
    objects open:
     files open: 1
         dsets open: 10472
         groups open: 1236
         types open: 0
         attrs open: 4312

and uses ~300MB purely by visiting the whole file.

Vindaar commented 6 years ago

In addition to that, more importantly even, the current way of opening a file "in its entirety" leads to very bad performance for large files!

First of all we should replace H5Ovisit by H5Lvisit and see if that improves performance when visiting the file. Then, we should rework our getters for groups and datasets as to stop opening each object, as we add it to the tables.

Vindaar commented 5 years ago

https://github.com/Vindaar/nimhdf5/commit/7dccc71117e3808c44dd0415e39fb07d0fad6347 changes the default behavior for attributes. This was the largest cause for slow down on very large H5 files (> 15 GB if many attributes).