bundle: optimize storage with differential commits

erichanson commented 4 years ago

Bundle storage could be a lot more efficient. Blobs that haven't changed between commits are reused, so that is fairly efficient. One could imagine some kind of line-by-line diffing like git does, but that's not even the lowest-hanging fruit: Each commit generates a new uuid for every field of every row in the database. Even if the commit is only changing a few rows, an entire rowset is regenerated. Turns out for some very large bundles with a large number of commits (like the IDE), these rowset_row_field records are taking up more space than the contents of the bundle:

eric@vultr:~/aquameta/bundles-available/org.aquameta.core.ide$ du -sh *
3.5M    blob.csv
4.0K    bundle.csv
12K commit.csv
4.0K    rowset.csv
672K    rowset_row.csv
7.2M    rowset_row_field.csv

This could potentially be addressed with differential commits, ones that only generate new rowset_row and rowset_row_field records for values that the commit actually changed. It would also considerably increase the complexity (and potentially fragility) of the commit structure. We would also incur a performance hit potentially, since the real contents of the commit would have to be reconstructed recursively by traversing the commit ancestry. It's a lot less clean and a lot more complex. But would result in a big increase in storage efficiency. Discuss.

themightychris commented 4 years ago

Is this actually a git repository under the hood?

erichanson commented 4 years ago

No, it is a reimplementation of git but for rows in the database instead of files. It is written in SQL and pl/pgsql.

aquametalabs / aquameta

bundle: optimize storage with differential commits #205