Closed raulk closed 3 years ago
Segregate the chain and state stores into two entirely different blockstore domains
For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()
ing a block that was not Put()
and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.
This isn't something you need to immediately worry about because, as the issue notes:
given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection
But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.
Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately:
Analysis of status quo
Consequences of status quo
The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.
Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.
When disk space runs out, badger corrupts the store, and terrible things happen, including panics: https://discuss.dgraph.io/t/badger-panics-with-index-out-of-range/11303
This situation is unsustainable.
Proposed solutions
✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:
✅ Further divide each blockstore domain in two tiers:
✅ Implement an archival process; for the state store, every Finality tipsets (900):
✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.
When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.
Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (#4717 might be a start).
✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.
Balance between fsync and no fsync at all.
✅ Memory watchdog. https://github.com/filecoin-project/lotus/issues/5058
Caveats