State tree / app state garbage collection

aakoshh commented 10 months ago

I was thinking about implementing garbage collection for the ledger.

See the Forest and Lotus implementations.

Forest uses a "semi-space" strategy of copying the reachable blocks in the store to a new database and deleting the old one, always having the last two databases, writing into the 'current' and reading from both 'current' and 'old'. They use ParityDB with hashed keys, so they cannot enumerate keys in their original format, which would be required for filtering if they wanted to implement "mark and sweep" GC. Their strategy works because everything in their store is IPLD data, so they can analyse reachability from the tipset roots, and everything relevant gets copied.

That's not how we use RocksDB: we maintain separate column families for different types of data:

Application state history is not based on CIDs, just using RocksDB as a KVStore
Actor state is an IPLD Blockstore in its own namespace (column family)
Bitswap Store is an IPLD Blockstore reading its own namespace as well as the actor namespace, but only writing to its own

The latter separation was introduced so that data arriving over bitswap has no chance of affecting actor calculations while reachability analysis is not completed by the FVM. This shouldn't be an issue with EVM actors, but if we had builtin actors, it could be. NB now that we have snapshots and potentially garbage collection itself can also make the actor storage itself different, so random CID lookups cannot be considered determinstic.

I was thinking about implementing a sort of mark-and-sweep with Bloom filters on the actor state:

when we start collecting, we start entering new keys written to the store into a Bloom filter, which is to prevent any data created by contract executions from being deleted soon after
we traverse the reachable blocks from the application history and enter their CIDs into another Bloom filter
once done, we send the history Bloom filter to the store and merge it into the other one
Then we iterate the keys and send it to the store for deletion iff the key is not in the Bloom filter (if it's in the Bloom filter, it's very likely that it's because it is reachable)

This way we can collect e.g. 95% of the unused keys (depends on how big the Bloom filter is) using limited memory and no blocking.

The downside is that the database has to be wrapped into something that receives read and write requests through a channel (reads should go in the same channel if we want them to be consistent with writes), and processes them in a thread. It can potentially preform deletions in batches.

RocksDB should eventually compact the data to reclaim free space and heal the fragmentation. This is a problem that the "semi-space" strategy doesn't have, but in return it doesn't require 100% extra disk space.

A different question is how to deal with the Bitswap store, where there are no obvious reachability guidelines. For example we wanted to use Bitswap for resolving cross-messages, and there is no obvious way to tie these to any root, because unlike Forest which has the tipset, in our case the only custodian for blocks and transactions is CometBFT. If we want roots for cross messages (checkpoints) we have to add them to the application state explicitly. Then we could run the same garbage collector on the bitswap store as well. For now the codepaths using Bitswap hasn't been enabled yet.

Maybe worth noting that it is also possible to drop an entire column family, to implement "semi-space" on a per-namespace level, but it's probably not a good idea because we don't know when RocksDB will reclaim the space, so it would potentially duplicate the storage requirement for an unknown amount of time, whereas the multi-database approach simply deletes the files no longer in use.

We could potentially also change the way the application state history is maintained to be an AMT, to have a single blockstore. The KVStore abstraction is different in that it offers transactionality.

cryptoAtwill commented 3 months ago

Without "stopping the world", not sure if this will introduce errors. Say if a key is not used when the sweep happens, so it's not recorded in the final bloom filter. Then when deleting unused keys, but there are new request and written to one of the unused key. But the bloom filter does not have that key recorded (unfortunate), then the key will be deleted.

aakoshh commented 2 months ago

In my mind there was no "final" Bloom filter: The database wrapper would enter into a state where it uses a Bloom filter to record writes, which are not to be deleted; at some point it will receive a historical Bloom filter which is the result of traversing all the reachable data from the state roots we want to keep; it merges the two but keep adding new writes to the now combined Bloom filter; meanwhile it will also receive batches of old keys to delete iff they are not in the current Bloom filter. At some point the process ends and we can remove the Bloom filter and go back to normal mode.

I'm not sure what you mean exactly by a "key is not used when the sweep happens"; I understand it as it's not being written to, and it's also not reachable in the history we want to keep. Then we start the deletion and this key is sent to the wrapper to be deleted. Two things can happen: 1) it gets deleted, but then the write puts it back, or 2) it's written and recorded in the current Bloom filter, and subsequently it's not deleted. Either way we should end up with the same state.

raulk commented 2 months ago

Indeed, stop the world is not necessary. We can implement concurrent mark and sweep algos, although there may be some critical sections where we do need to stop writes. We can always serve reads, as long as APIs are aware of the garbage collection runs, e.g. when we start pruning height HEAD-10000 and earlier, the APIs should immediately begin rejecting queries for the included heights.

There is some precedent of this in Lotus' splitstore (hot/cold store), but here's your gentle warning that it's complex, as its design goal was to keep an eden and a tenured generation; eden/hot for the critical blockchain validation path, and tenured/cold for user queries.

I like the idea of using bloom filters, tracking live writes into mergeable bloom filters, and deleting in batches where we retest against live bloom filters.

We can tune bloom filter size and number of hashes depending on the entry count at the start of the round (which RocksDB can estimate for us).

Number of Keys	False Positive Rate	Bloom Filter Size (MB)	Number of Hash Functions
1,000,000	0.1%	1.71	10
1,000,000	1.0%	1.14	7
1,000,000	2.0%	0.97	6
1,000,000	5.0%	0.74	4
10,000,000	0.1%	17.14	10
10,000,000	1.0%	11.43	7
10,000,000	2.0%	9.71	6
10,000,000	5.0%	7.43	4
100,000,000	0.1%	171.39	10
100,000,000	1.0%	114.26	7
100,000,000	2.0%	97.06	6
100,000,000	5.0%	74.33	4

consensus-shipyard / ipc

State tree / app state garbage collection #154