chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc)

raulk commented 4 years ago

Analysis of status quo

Right now, chain data and state data live in a ginormous monolith store.
This store only grows, it never shrinks. Lotus doesn't do any active management on this store, nor GC.
Lotus doesn't offer configurable retention policies. The user can't specify:
- please only retain the full chain, but only state objects from the last 50000 epochs
- please only retain chain and state data from this and the last finality ranges
- please retain everything
We do not offer integration with snapshot services. The current snapshot facility is rather makeshift and it requires very active user intervention (manually download a snapshot, stop the node, backup store, replace, start node, etc.) The user experience is... improvable.
- Snapshots are becoming a necessary element in Filecoin. The experience needs to be integrated end-to-end from a product perspective.
- That said, snapshots are NOT an object of this issue, but the improvements introduced herein should serve as stepping stones towards a more cohesive experience.
Lotus doesn't natively offer different node profiles: "archive", "full node", etc. These are not formally specified, and spontaneously emerge based on how the user operates their node.
- However, the reality is that each of these profiles requires different policies for store maintenance, which we aren't able to offer because we don't model the profiles to begin with.
specs-actors and ADTs do not collaborate with Lotus to advise/hint which objects (e.g. HAMT nodes) have gone out of scope, or been delinked, as a result of state transition.
- We might need to propagate this information out to higher layers to enable refcounting, or other forms of tracking to feed into GC.
Lotus does not prune chain nor state objects from abandoned chain branches.

Consequences of status quo

The current state store keeps ever growing at a relatively unpredictable rate because it's influenced by various dynamic factors like number of messages, kinds of messages, state transitions, chain branching, etc.
Users can run out of disk spaces unexpectedly, because badger allocates files like SST tables and vlogs in blocks. Such actions are triggered with writes accumulation (which leads to flushing memtables onto L0), and LSM level compactions.
When disk space runs out, badger corrupts the store, and terrible things happen, including panics: https://discuss.dgraph.io/t/badger-panics-with-index-out-of-range/11303
This situation is unsustainable.

Proposed solutions

✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:
- independent storage engines, suitable to its specific access patterns (e.g. B+, hash indices, LSM, etc.)
- specific caching policies.
- optimised garbage collection/archival processes.
- ✅ done in https://github.com/filecoin-project/lotus/pull/4771
✅ Further divide each blockstore domain in two tiers:
- Active tier: contains objects from the current finality range.
- Inactive tier: contains out of scope objects.
- ✅ https://github.com/filecoin-project/lotus/pull/4992
✅ Implement an archival process; for the state store, every Finality tipsets (900):
- asynchronously (background) walk the state tree, tracking all live CIDs in a bloom filter (if we keep a key count, we could size the bloom filter for a specific desired FPP rate). Let's call this BF1.
- while that happens, track all new CIDs that are being written ("delta set"), during that Finality range. Let's call the delta set Δ1.
- when the finality range elapses, start the process for the next finality range; let's call the resulting bloom filter and delta set BF2 and Δ2.
- once 2xFinality have passed, iterate through the active tier store and copy all CIDs that do not match the bloom filter from to the inactive tier. Tombstone those entries in the the active tier.
- the first time we do this, it'll be quite expensive. Next times it'll become way lighter.
- mmapped B+ trees like will take this workload much better in the active set. For badger and/or LSM trees, we'll need to Flatten/Compact frequently to actually remove deleted entries.
- ✅ https://github.com/filecoin-project/lotus/pull/4992
✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.
- ✅ https://github.com/filecoin-project/lotus/pull/4992
When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.
- ➡️ tracked in https://github.com/filecoin-project/lotus/issues/5056.
Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (#4717 might be a start).
- ✅ done in https://github.com/filecoin-project/lotus/pull/4717 (optional and needs to be activated with an env variable)
✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.
Balance between fsync and no fsync at all.
- ➡️ tracked in https://github.com/filecoin-project/lotus/issues/5057.
✅ Memory watchdog. https://github.com/filecoin-project/lotus/issues/5058

Caveats

Some commands allow the user to override the chain, e.g. set head/follow/mark-bad, etc. We need to discuss how those commands would affect what's being laid out here.

anorth commented 4 years ago

Segregate the chain and state stores into two entirely different blockstore domains

For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()ing a block that was not Put() and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.

This isn't something you need to immediately worry about because, as the issue notes:

given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection

But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.

raulk commented 3 years ago

Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately:

filecoin-project / lotus