Open Vindaar opened 6 years ago
Very much related: it's probably not a good idea to open each group and especially dataset, which we encounter, when visiting the whole file. And getting the dataspace immediately is useless, too (I think...).
In principle we only need to open the dataset, when we actually access the data in it. For everything else, we just need to keep track of the information of the dataset in the file, e.g. datatype, shape etc. That's why we have an abstract interface in the first place...
The file https://github.com/Vindaar/nimhdf5/blob/master/tests/tDebugRamUsage.nim just visits a large file (which currently opens each dataset, group and attribute in it) and waits a few seconds. Running it on a 30GB h5 file it outputs:
Visiting file...
objects open:
files open: 1
dsets open: 10472
groups open: 1236
types open: 0
attrs open: 4312
and uses ~300MB purely by visiting the whole file.
In addition to that, more importantly even, the current way of opening a file "in its entirety" leads to very bad performance for large files!
First of all we should replace H5Ovisit
by H5Lvisit
and see if that improves performance when visiting the file. Then, we should rework our getters for groups and datasets as to stop opening each object, as we add it to the tables.
https://github.com/Vindaar/nimhdf5/commit/7dccc71117e3808c44dd0415e39fb07d0fad6347 changes the default behavior for attributes. This was the largest cause for slow down on very large H5 files (> 15 GB if many attributes).
We need a nicer way to close each H5 object individually, without calling the native H5 functions.
In addition implement closing of groups and datasets by default after a read / write etc. procedure. We can introduce some locking flag, which allows us to keep objects open, if the user desires. In some cases that might be useful, if one knows that several successive writes / reads of the same dataset will happen.