Open utterances-bot opened 2 years ago
Hi! This looks really interesting. Does this library just provide syntactic sugar, or does it actually do something in the background for efficiency/other reasons? For instance if mydataset
is the same in both v1
and v2
, are there two copies actually present on disk? Or is there just one copy, with two pointers going to the same data? (In my head I'm imagining something along the line of how git
works) Thanks!
@NickCrews it does reuse data, using a design that is very similar to git's. This post goes over the details https://labs.quansight.org/blog/2020/09/design-of-the-versioned-hdf5-library/. Basically if two versions of the same dataset have the exact same data in a given HDF5 chunk, that chunk will only be stored in the file once.
Introducing Versioned HDF5 | Quansight Labs
https://labs.quansight.org/blog/2020/08/introducing-versioned-hdf5/