Closed yihuang closed 1 year ago
Big fan of changing how we do writing and handling of historical state. Right now we are writing a specification and better testing for the store package to better understand the semantics and then we plan on making changes. There are a ton of changes we can do to bring huge improvements in performance.
If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:
WriteListener
, may just extend that?If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:
A hook for third-party to handle the change sets before committing to the iavl tree.
- I see there's already a
WriteListener
, may just extend that?- A hook to provide the external historical states to grpc query handlers.
this is adr-038, no?
If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:
A hook for third-party to handle the change sets before committing to the iavl tree.
- I see there's already a
WriteListener
, may just extend that?- A hook to provide the external historical states to grpc query handlers.
this is adr-038, no?
yeah, didn't know that, looks like exactly what we need.
it'd be great if adr-038 could be backported to as far as v0.44 though, so we can stream the chain since the genesis.
I guess one way to handle the legacy blocks, is to recompute the changesets by doing a complete comparison between two versions in the iavl tree, I assume that'd be too slow to be practical., let's just backport the streamer and dump the changesets.
// VersionStore is a versioned storage of a flat key-value pairs.
// it don't need to support merkle proof, so could be implemented in a much more efficient way.
// `nil` version means the latest version.
type VersionStore interface {
GetAtVersion(storeKey string, key []byte, version *int64) ([]byte, error)
HasAtVersion(storeKey string, key []byte, version *int64) (bool, error)
IteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
ReverseIteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
GetLatestVersion() int64
// Persist the change set of a block,
// the `changeSet` should be ordered by (storeKey, key),
// the version should be latest version plus one.
PutAtVersion(version int64, changeSet []types.StoreKVPair) error
}
Integrate VersionStore
into rootmulti.Store
VersionStore
to serve reading calls if provided.VPut
method in Commit
VersionStore
can be implemented on top of dbm.DB
, we might also experiment with lmdb/mdbx, which may not fit into the current dbm interface.
in this approach are you assuming iavl does not handle versions anymore?
I think if we fix iavl ky format queries will improve dramatically
Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!
I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> https://github.com/cosmos/cosmos-sdk/issues/12986
in this approach are you assuming iavl does not handle versions anymore?
yes, only keep the recent ones necessary for the consensus state machine, pruning=everything
.
I think if we fix iavl ky format queries will improve dramatically
We are mainly focus on db size right now, in local test, the new db format is 10x smaller than the iavl db. I see great potential in the new iavl key format too, that should improve the db size of iavl too, but won't match the versiondb.
Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!
The difference with the SS described in v2 store is it also support versioning, so it can support an archived grpc query service. The problem with changing iavl itself is, it's impractical to migrate the whole state on production network.
I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986
happy to move the discussion there. Currently the implementation can be done and experimented externally(https://github.com/crypto-org-chain/cronos/pull/722), thanks to the adr-038 state streaming interface. Now I need to hook the alternative db to the grpc query service, I'm trying to find a way that don't need to change too much to sdk internals.
personally the design of v2 was poor. the idea that proof support for historical versions means you need to reconstruct the tree is a bad design. Proof support while not widely used today is at the core of blockchain and decentralising data access.
Secondly, after talking to dev from osmosis, there are designs where proof support will want to exist for previous heights. We should allow an alternative for chains that don't care, but I think we should not write it off entirely
We should allow an alternative for chains that don't care, but I think we should not write it off entirely
Agree, that's why I try to keep the implementation outside of sdk for now. But will need to change the API a little bit to hook the db with the grpc query.
@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.
@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.
We have a prototype implementation already https://github.com/crypto-org-chain/cronos/pull/722 . this is only for reducing archive node db size by dropping historical merkle proofs, not a generic replacement for the iavl tree though.
Amazing. Do you have an ADR, spec or doc detailing the idea?
Amazing. Do you have an ADR, spec or doc detailing the idea?
MultiStore
with this dbI see. We can consider and evaluate that approach as we refactor the store. I'm not sure we'll go exactly with that approach though.
@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.
Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?
@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.
Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?
yes, it don't support proof generation, for that case, we can probably keep an archive iavl tree with more compression for that special case, for example, omitting the value
field from leaf nodes.
BTW, versiondb and iavl tree are interchangeable (assuming the writings are ordered by key in each version), in the sense that we can rebuild iavl tree from versiondb changesets, also can extract changesets from iavl tree versions, so in theory no information is lost, it's just too slow to rebuild iavl tree on the fly, I guess we can also keep some checkpoints to speedup iavl tree rebuild, which is more space efficient than a full archive.
@alexanderbez recently I'm also thinking about a design choice regarding what value to store in changeset, there are two options:
N
, and can already serve queries up to version N
. In the option 1, the changeset is only useful when combined with the latest states.On first glance, I think I prefer the current approach. Having the latest state + some subset (or all) of historical state seems like a pretty normal expectation to have.
Another small issue compared with erigon is, we can't do chunking of the bitmap index, in erigon the chunking is done by append chunk id to the key, but since our keys have variable length, we can't do that, otherwise, it'll interleave with other keys.
So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )
So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )
The most complete description is still this doc^1, I just made some complements, describing the bitmap chunking issue, and more advantages of the alternative design to store the after value.
I'm trying the rocksdb user-defined timestamp^1 to implement versiondb, it seems pretty good. I've implemented commands^2 to extract change set from iavl versions, and store the change set in plain file format, and write the change set into sst file, using the version as user-defined timestamp of keys. The one downside is it bind with rocksdb backend, at least for the versiondb part, good-side is it's dead simple to implement and likely more performant than custom indexings.
Something we'll discuss in the storage WG!
closing this because of version db and the work that will commence with store v2
Plain Historical State
Context
The archive node's db size increase very fast, akthough pruning is good, but many users still need to access the complete chain history, to keep it sustainable, we need a way to reduce the disk space requirement of archive nodes.
One observation is that most of the time, we don't need to generate Merkle proofs for historical states, so we can prune the iavl tree aggressively and keep the historical states in a compact format.
Proposal
At the end of the block, before committing the iavl tree, also save the change sets in another archive storage, the storage format is explained below.
Support querying from the archive storage transparently.
Nodes can use much more aggressive pruning settings for iavl tree, and still be able to query full chain history.
The total db size is expected to be much lower than an archived iavl tree.
Write Amplification
The write amplification is increased, but it should be ok if writing speed is not the bottleneck for the workload (need to verify), after all, we only do a batch of writes per several seconds.
Archive Format
Option 1
Store in kv db (inspired by erigon ^1).
History Index,
key -> set of block numbers
The set of block numbers that modified the key, is encoded as a bitmap.
Change Sets,
(block number, key) -> value
Record the prior value for each modification.
Option 2
Store the history index in kv db.
Store the change sets in plain files in an append-only way, together with some index files to record the offsets of blocks.
When querying, first locate the block number through the history index, then locate the index into the change set file, then find the key in the change set of the block with linear search.
If the change set of a block is big, it could be encoded as a mmap-able b-tree, for more efficient querying.