datacrypt-project / hitchhiker-tree

Functional, persistent, off-heap, high performance data structure
Eclipse Public License 1.0
1.19k stars 64 forks source link

Tracing GC for data segments #8

Open dgrnbrg opened 8 years ago

dgrnbrg commented 8 years ago

We need a tracing GC for data segments, so that we can know which blocks are no longer active and can be garbage collected. We'll build this as a storage middleware.

This will ensure the DB doesn't grow to use unbounded backend storage space.

dgrnbrg commented 8 years ago

The general algorithm will probably be to enumerate all the keys in the retained trees, and then scan the old roots and delete all keys which are not in any retained tree.

One thing we could try is to store a bloom filter of each tree's root. Then, when we want to GC starting from an old root, we could check each of its nodes against all the newer bloom filters, and only delete the node if it's not in a newer bloom filter. We could quickly keep these filters up-to-date by augmenting the HH tree with them, but they may waste a lot of storage space.

It would be better if we had an algorithm that traded off storage space & # of keys we need to scan from the live trees. Perhaps we could augment the HH tree, but skip some of the levels?