Quansight-Labs / quansight-labs-site

💻 Development site and blog for Quansight Labs
https://labs.quansight.org
24 stars 44 forks source link

Introducing Versioned HDF5 | Quansight Labs #305

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Introducing Versioned HDF5 | Quansight Labs

https://labs.quansight.org/blog/2020/08/introducing-versioned-hdf5/

NickCrews commented 2 years ago

Hi! This looks really interesting. Does this library just provide syntactic sugar, or does it actually do something in the background for efficiency/other reasons? For instance if mydataset is the same in both v1 and v2, are there two copies actually present on disk? Or is there just one copy, with two pointers going to the same data? (In my head I'm imagining something along the line of how git works) Thanks!

asmeurer commented 2 years ago

@NickCrews it does reuse data, using a design that is very similar to git's. This post goes over the details https://labs.quansight.org/blog/2020/09/design-of-the-versioned-hdf5-library/. Basically if two versions of the same dataset have the exact same data in a given HDF5 chunk, that chunk will only be stored in the file once.