filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

chain/state store improvements: REDESIGN (segregation, two-tier stores, archival, etc) #4753

Closed raulk closed 3 years ago

raulk commented 4 years ago

Analysis of status quo

Consequences of status quo

Proposed solutions

  1. ✅ Segregate the chain and state stores into two entirely different blockstore domains, each of which can operate with:

  2. ✅ Further divide each blockstore domain in two tiers:

  3. ✅ Implement an archival process; for the state store, every Finality tipsets (900):

    • asynchronously (background) walk the state tree, tracking all live CIDs in a bloom filter (if we keep a key count, we could size the bloom filter for a specific desired FPP rate). Let's call this BF1.
    • while that happens, track all new CIDs that are being written ("delta set"), during that Finality range. Let's call the delta set Δ1.
    • when the finality range elapses, start the process for the next finality range; let's call the resulting bloom filter and delta set BF2 and Δ2.
    • once 2xFinality have passed, iterate through the active tier store and copy all CIDs that do not match the bloom filter from to the inactive tier. Tombstone those entries in the the active tier.
    • the first time we do this, it'll be quite expensive. Next times it'll become way lighter.
    • mmapped B+ trees like will take this workload much better in the active set. For badger and/or LSM trees, we'll need to Flatten/Compact frequently to actually remove deleted entries.
    • https://github.com/filecoin-project/lotus/pull/4992
  4. ✅ Implement a tiered blockstore abstraction, such that we query the active tier and then the inactive tier serially.

  5. When archiving into the inactive store, tag each block with the epoch it was last active at, or use some form of record striping. This enables us to create and configure retention policies as outlined in the Analysis section, e.g. "store up to 50000 epochs in the past". We can run periodic GC by iterating over the inactive store and discarding entries/stripes beyond the window.

  6. Implement a fallback Bitswap trapdoor to fetch objects from the network in case something goes wrong, or the user requests an operation that requires access to chain/state beyond the retention window (#4717 might be a start).

  7. ✅ Implement the migration, either as an in-place, background process that runs inside Lotus, or as a dedicated external command that runs with exclusive access to the store (i.e. Lotus stopped). The choice/feasibility will depend on the final solution design.

  8. Balance between fsync and no fsync at all.

  9. ✅ Memory watchdog. https://github.com/filecoin-project/lotus/issues/5058

Caveats

anorth commented 4 years ago

Segregate the chain and state stores into two entirely different blockstore domains

For some more context, this is actually desirable/required semantics for the runtime store abstraction presented to the actors. Desired semantics are actually even tougher, requiring a consistent view of state that should prevent an actor Get()ing a block that was not Put() and transitively reachable from the state root in the blockchain history/fork that's actually being evaluated.

This isn't something you need to immediately worry about because, as the issue notes:

given our control of the built-in actor code, we can ensure that the semantics are indistinguishable from having no views, transactions, or garbage collection

But it's something to keep in mind, and ideally make more possible, rather than less possible, for future implementation along with end-user contracts.

raulk commented 3 years ago

Now that the splitstore shipped as an experiment in v1.5.1, and the memory watchdog has been active and silently keeping memory utilisation within bounds for a few releases, this epic can finally be closed. There are two offshot threads that are tracked separately: