cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.2k stars 3.59k forks source link

Store plain historical states to reduce the db size of archive node #13317

Closed yihuang closed 1 year ago

yihuang commented 2 years ago

Plain Historical State

Context

The archive node's db size increase very fast, akthough pruning is good, but many users still need to access the complete chain history, to keep it sustainable, we need a way to reduce the disk space requirement of archive nodes.

One observation is that most of the time, we don't need to generate Merkle proofs for historical states, so we can prune the iavl tree aggressively and keep the historical states in a compact format.

Proposal

At the end of the block, before committing the iavl tree, also save the change sets in another archive storage, the storage format is explained below.

Support querying from the archive storage transparently.

Nodes can use much more aggressive pruning settings for iavl tree, and still be able to query full chain history.

The total db size is expected to be much lower than an archived iavl tree.

Write Amplification

The write amplification is increased, but it should be ok if writing speed is not the bottleneck for the workload (need to verify), after all, we only do a batch of writes per several seconds.

Archive Format

Option 1

Store in kv db (inspired by erigon ^1).

Option 2

Store the history index in kv db.

Store the change sets in plain files in an append-only way, together with some index files to record the offsets of blocks.

When querying, first locate the block number through the history index, then locate the index into the change set file, then find the key in the change set of the block with linear search.

If the change set of a block is big, it could be encoded as a mmap-able b-tree, for more efficient querying.

tac0turtle commented 2 years ago

Big fan of changing how we do writing and handling of historical state. Right now we are writing a specification and better testing for the store package to better understand the semantics and then we plan on making changes. There are a ton of changes we can do to bring huge improvements in performance.

yihuang commented 2 years ago

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

tac0turtle commented 2 years ago

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

  • A hook for third-party to handle the change sets before committing to the iavl tree.

    • I see there's already a WriteListener, may just extend that?
  • A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

yihuang commented 2 years ago

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

  • A hook for third-party to handle the change sets before committing to the iavl tree.

    • I see there's already a WriteListener, may just extend that?
  • A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

yeah, didn't know that, looks like exactly what we need.

yihuang commented 2 years ago

it'd be great if adr-038 could be backported to as far as v0.44 though, so we can stream the chain since the genesis.

yihuang commented 2 years ago

I guess one way to handle the legacy blocks, is to recompute the changesets by doing a complete comparison between two versions in the iavl tree, I assume that'd be too slow to be practical., let's just backport the streamer and dump the changesets.

yihuang commented 2 years ago

Implementation Plan

// VersionStore is a versioned storage of a flat key-value pairs.
// it don't need to support merkle proof, so could be implemented in a much more efficient way.
// `nil` version means the latest version.
type VersionStore interface {
    GetAtVersion(storeKey string, key []byte, version *int64) ([]byte, error)
    HasAtVersion(storeKey string, key []byte, version *int64) (bool, error)
    IteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
    ReverseIteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
    GetLatestVersion() int64

    // Persist the change set of a block,
    // the `changeSet` should be ordered by (storeKey, key),
    // the version should be latest version plus one.
    PutAtVersion(version int64, changeSet []types.StoreKVPair) error
}

Integrate VersionStore into rootmulti.Store

VersionStore can be implemented on top of dbm.DB, we might also experiment with lmdb/mdbx, which may not fit into the current dbm interface.

tac0turtle commented 2 years ago

in this approach are you assuming iavl does not handle versions anymore?

I think if we fix iavl ky format queries will improve dramatically

alexanderbez commented 2 years ago

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> https://github.com/cosmos/cosmos-sdk/issues/12986

yihuang commented 1 year ago

in this approach are you assuming iavl does not handle versions anymore?

yes, only keep the recent ones necessary for the consensus state machine, pruning=everything.

I think if we fix iavl ky format queries will improve dramatically

We are mainly focus on db size right now, in local test, the new db format is 10x smaller than the iavl db. I see great potential in the new iavl key format too, that should improve the db size of iavl too, but won't match the versiondb.

yihuang commented 1 year ago

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

The difference with the SS described in v2 store is it also support versioning, so it can support an archived grpc query service. The problem with changing iavl itself is, it's impractical to migrate the whole state on production network.

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986

happy to move the discussion there. Currently the implementation can be done and experimented externally(https://github.com/crypto-org-chain/cronos/pull/722), thanks to the adr-038 state streaming interface. Now I need to hook the alternative db to the grpc query service, I'm trying to find a way that don't need to change too much to sdk internals.

tac0turtle commented 1 year ago

personally the design of v2 was poor. the idea that proof support for historical versions means you need to reconstruct the tree is a bad design. Proof support while not widely used today is at the core of blockchain and decentralising data access.

Secondly, after talking to dev from osmosis, there are designs where proof support will want to exist for previous heights. We should allow an alternative for chains that don't care, but I think we should not write it off entirely

yihuang commented 1 year ago

We should allow an alternative for chains that don't care, but I think we should not write it off entirely

Agree, that's why I try to keep the implementation outside of sdk for now. But will need to change the API a little bit to hook the db with the grpc query.

alexanderbez commented 1 year ago

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

yihuang commented 1 year ago

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

We have a prototype implementation already https://github.com/crypto-org-chain/cronos/pull/722 . this is only for reducing archive node db size by dropping historical merkle proofs, not a generic replacement for the iavl tree though.

alexanderbez commented 1 year ago

Amazing. Do you have an ADR, spec or doc detailing the idea?

yihuang commented 1 year ago

Amazing. Do you have an ADR, spec or doc detailing the idea?

alexanderbez commented 1 year ago

I see. We can consider and evaluate that approach as we refactor the store. I'm not sure we'll go exactly with that approach though.

alexanderbez commented 1 year ago

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

yihuang commented 1 year ago

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

yes, it don't support proof generation, for that case, we can probably keep an archive iavl tree with more compression for that special case, for example, omitting the value field from leaf nodes. BTW, versiondb and iavl tree are interchangeable (assuming the writings are ordered by key in each version), in the sense that we can rebuild iavl tree from versiondb changesets, also can extract changesets from iavl tree versions, so in theory no information is lost, it's just too slow to rebuild iavl tree on the fly, I guess we can also keep some checkpoints to speedup iavl tree rebuild, which is more space efficient than a full archive.

yihuang commented 1 year ago

@alexanderbez recently I'm also thinking about a design choice regarding what value to store in changeset, there are two options:

alexanderbez commented 1 year ago

On first glance, I think I prefer the current approach. Having the latest state + some subset (or all) of historical state seems like a pretty normal expectation to have.

yihuang commented 1 year ago

Another small issue compared with erigon is, we can't do chunking of the bitmap index, in erigon the chunking is done by append chunk id to the key, but since our keys have variable length, we can't do that, otherwise, it'll interleave with other keys.

alexanderbez commented 1 year ago

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

yihuang commented 1 year ago

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

The most complete description is still this doc^1, I just made some complements, describing the bitmap chunking issue, and more advantages of the alternative design to store the after value.

yihuang commented 1 year ago

I'm trying the rocksdb user-defined timestamp^1 to implement versiondb, it seems pretty good. I've implemented commands^2 to extract change set from iavl versions, and store the change set in plain file format, and write the change set into sst file, using the version as user-defined timestamp of keys. The one downside is it bind with rocksdb backend, at least for the versiondb part, good-side is it's dead simple to implement and likely more performant than custom indexings.

alexanderbez commented 1 year ago

Something we'll discuss in the storage WG!

tac0turtle commented 1 year ago

closing this because of version db and the work that will commence with store v2