Trie pruning v9000 - Githubissues

karalabe commented 3 years ago

[Credits: This is a summary of various ideas from @rjl493456442, @fjl and myself]

Currently Geth's trie pruning is based on an in-memory (refcount based) garbage collector. This works well as long as all the freshly inserted dirty nodes fit into memory. When we exceed our dirty cache allowance, we need to overflow trie nodes to disk. To maximize the probability that flushed trie nodes will live long (not pruned), the flush order is older nodes first. The downside is that these nodes often will form dangling subtries (since the roots will get GCd) that are very hard to find and remove from disk.

In general, the root cause why pruning is hard is that states belonging to different blocks share almost the same data, so to avoid copying the state for every block, common parts are deduplicated in the database. However, once arbitrarily many and arbitrarily old blocks reference the same data, it becomes extremely expensive to decide when certain trie data becomes unreferenced and can be cleaned up. Historically we've tried various attempts at storing extra metadata (parent links, ref counts, ensuring only full tries are written) and using those to prune, but they all suffered from the same problem: shuffling persistent metadata at our order of magnitude is unfeasible.

The theoretical solution for pruning is to ensure that we always know - without computation - which block a certain state (or state portion) belongs to, allowing us to easily delete things that are not referenced any more. Of course, with state tries weighing 90GB, reduplicating trie nodes is out of the question. Retaining our current storage mechanisms, the only way to know exactly which block a state trie belongs to, is to actually only ever store a single state trie.

This is not a new concept, we're doing the same thing in the snapshotter: we have a single persistent snapshot on disk at a relatively recent block - but older than reasonable reorg depths - and we have a number of in-memory diff layers that are used for mini reorgs. As the chain progresses, we gradually flatten older diff layers to the persistent disk layer, pushing it forward. But there's always a single state snapshot that corresponds to a well defined block.

The proposal is to do the same thing for tries. Instead of maintaining potentially infinite (and potentially dangling) tries in our database that cross link to each other all over the place, we can chose to maintain a single trie persisted to disk (the same as the persistent snapshot layer); and keep all newer tries (the changes) in memory only. Whenever we flush any data into the snapshot layer, we atomically also flush the same data into the trie.

Of course, the devil's in the details.

Reduplicate storage

Currently we only ever insert trie data into the database, but we never delete. The idea outlined here - of storing a single trie - however requires deletion (e.g. an account is changed, the old leaf needs to be removed; or a storage a slot is freed up). The problem is, currently storage tries are deduplicated and nodes are stored by hash. This means, figuring out if a storage trie node can be deleted is almost exactly the same pruning problem as we have globally (since storage tries reference each other).

The upside is that contract storage is small and mostly unique. As such, the best solution to get rid of the optimization that deduplicated storage tries, and instead make each contract have it's own dedicated namespace on disk. This would end up wasting some disk space, but will allow us to decide if a slot can be deleted with only local information, ignoring that other contracts might or might not store the same data. The namespace would be the contract address hash (i.e. the path in the account trie).

This isn't a new idea either, the snapshotter does exactly the same thing for the exact same reasons. The catch with the trie storage is that this new model breaks our capacity to support fast sync. This is because fast sync requires hash -> trie node lookups. By reduplicating storage, the added namespace in the database prevents us from finding a node by hash. This is also the main reason why we couldn't do any of these optimizations until now. However, since the introduction of snap sync, the state heal phase uses path based trie node lookups.

Long story short, maintaining a single trie in the database requires reduplicating the storage trie nodes and storing them under the contract address hash namespace; and doing that requires us to either drop support for serving GetNodeData in eth/66. Other clients violated this protocol requirement a long time ago, so yeah, YOLO?

Super finality

Currently Geth has a notion of "chain immutability threshold", which is essentially a limit on the number of blocks Geth is willing to reorg. The limit is set so that a malicious node is unable to feed us an alternate chain from the genesis block (which would be easy to mine). Of course an attacker wouldn't be able to feed us a "better" chain, but it's also important to prevent them making us do a lot of work before figuring that out.

The immutability threshold is set to 3 epochs (~two wooks). This might seem excessive - and indeed Geth is the only client on the network which can do this - but the rationale is that bugs happen. In case there's a consensus issue in Geth, it can take a few hours to diagnose and patch, after it can take more hours or days for everyone to upgrade. If both good and bad chains move in the mean time, it can happen that the reorg would become deeper than allowed and clients get stuck on the bad chain. This is indeed what happens on pretty much every other client: they often reqire their users to resync from Geth. Geth until now successfully handled deep reorgs both caused by consensus issues as well as by attacks on testnets. The 2 weeks was chosen as a worst case scenario to fix things.

Back to the issue at hand, the catch with the long immutability threshold is that Geth needs to retain the ability to do such a deep reorg. Currently our leaky-insert-only-trie implicitly satisfies this, since all that trie data we occasionally write to disk remains on disk, so a deep reorg will eventually find a block that has full state available. If we fix the trie to be singleton, we may also lose the ability to reorg beyond it.

Things aren't that bad though. A healthy network is never expected to do deeper reorgs than a handful of blocks. This is an important observation, because it means that although we need to retain the capability to do a deep reorg, we can afford to make it arbitrarily expensive (computationally), since it will only happen in exceptional scenarios. Instead of maintaining quickly accessible old state, we only need to have a capacity to regenerate old state.

The solution to that is to retain a batch of reverse diffs up to the immutability threshold. These diffs are essentially the same as the snapshot diff layers, the catch is that they can in theory be arbitrarily large (e.g. deleting crypto kitties would make for a hefty diff). As such, the reverse diffs are ought to be stored outside of leveldb in a format where appending to the end and popping of both front (chain moved ahead) and end (chan reorgs) is fast.

Long story short, we need a reverse diff format and a persistent storage mechanism for it; as well as a way to move both the persistent snapshot and our singleton trie backwards in history based on these reverse diffs; up until the chain immutability threshold.

Let's call the chain immutability threshold from now on super finality to make fun of Eth 2, because they will need the same mechanism to be able to reorg finalized blocks.

Tree of tries

The above few sections defined how we can maintain a singleton state trie and how we can manage deep reorgs. The last piece of the puzzle is how to handle shallow reorgs (shorter than the depth of the persistent singleton trie). Snapshots do this via a tree of in-memory diff layers, which are flattened into the persistent disk layer when the chain moves far enough. This model would work for updating the persistent singleton trie, but it does not support calculating root hashes while doing mini reorgs.

The current trie dirty cache and in-memory garbage collection could be used to maintain the tree of tries, moving the window forward as the chain progresses, but the question is how to clean up dirty data that we've flattened out of bounds into the persistent trie. Additions are simple, any node added to the singleton trie would be deleted from the dirty cache.The problem arises when an attempt is made to delete a node from the singleton disk trie, but that node is referenced by an in-memory trie later (e.g. a slot is unset and later reset). That would require moving the node deleted from disk back into the dirty cache, which might be both brittle and perhaps even computationally too expensive.

Before figuring this part out, an important thing would be to have a clear benchmark that puts a number on the trie nodes added and removed in current mainnet blocks. That could help us figure out what the churn is, and whether we can afford to duplicate some data across tries. Getting rid of the in-memory GC altogether would make everything a lot simpler, as long as we can make it work somehow.

Long story short: we need to collect some metrics on the trie churn and start monitoring it, then figure out a data model that allows us to cheaply maintain recent state, whilst also supporting pruning it as the chain progresses.

Feasibility

Opposed to the current garbage collection based trie handling, enforcing a persistent singleton trie that follows the chain means that there is a well defined amount of data that needs to be inserted and deleted from the database at a constant rate. This may make or break the entire idea, so we need to benchmark what caching mechanism we could use to try and keep the disk writes low without affecting the general operation of the idea. The snapshots use the bottommost diff layer as an accumulation layer that buffers writes, trying to dedup modifications made to the same slots across multiple blocks. This may or may not work for tries, but it is important to answer the question before investing too much time in implementing everything.

Attack plan

[ ] Spec eth/67 that removes support for GetNodeData (for eventual protocol cleanup)
[ ] Stop serving GetNodeData in eth/66 (other clients can use snap/1 to retrieve the same data by path)
[ ] Reduplicate storage trie nodes under the account hash namespace
[ ] Define a reverse diff format and an efficient storage mechanism for it outside of leveldb
[ ] Implement support for applying reverse diffs to snapshots and state tries
[ ] Benchmark and/or monitor the state churn (additions and removals)
[ ] Based on how much data we juggle, either use simple duplications or more complex dedups.
[ ] Benchmark if an accumulation layer could help avoid trashing the db with updates after every block

qianbin commented 3 years ago

When trying to increase the throughput to flush a MPT into leveldb, I found a nearly perfect way to prune a trie.

After many tries, i realized that there's no much space to increase KV throughput by optimizing a LSM based db. But how if saving trie nodes in order? Then I tried to tweak the trie node persistent KV pair from

hash => RLP(node)

to

big-endian-uint32(blockNum) || hash => RLP(node) || RLP(child hash node blockNums)

By doing so, write throughput is extremely boosted. Every leveldb major compaction finishes immediately, because SSTs are not overlapped.

Have to sleep. To be continued.

fjl commented 3 years ago

It should not noted that this state storage scheme makes it impossible to perform historical chain tracing.

qianbin commented 3 years ago

Have to sleep. To be continued.

The key point of my idea is: Trie nodes with the same path are duplicated, and older ones can be safely deleted.

Till now, with blockNum prefix, we know which trie node is older, but have no idea about its position, or say path. Re-tweak the node persistent KV like this:

 path || big-endian-uint32(blockNum) || hash => RLP(node) || RLP(child hash node blockNums)

Assume we want to prune duplicated trie nodes between block number [M, N), where N satisfies "chain immutability threshold", and there is a node iterator which can filter out trie nodes by block number. The pseudo code will be:

mask := make(map[string]uint32)

// filter out trie(N)'s nodes produced after block M, and 
// construct the map from path to block number the node belongs to.
it := NewTrie(rootN).NodeIterator(bn => bn > M) 
for it.Next() {
    mask[string(it.Path())] = it.BlockNum()
}

dbIt := db.NewIterator(trieSpacePrefix)
for dbIt.Next() {
    key := dbIt.Key()
    path := key[len(key) - 32 - 4:]
    bn := uint32(key[len(path):])
    // the current node is older than N's.
    if bn < mask[path] {
        db.Delete(key)
    }
}

jochem-brouwer commented 3 years ago

Will there be an alternative for GetNodeData in eth/67?

rjl493456442 commented 3 years ago

@jochem-brouwer The path-based trie node retrieval is still available. https://github.com/ethereum/devp2p/blob/master/caps/snap.md#gettrienodes-0x06

karalabe commented 3 years ago

@jochem-brouwer @rjl493456442 I guess the question is whether we should duplicate the same functionality in eth/67 too, or rely only on snap. The latter is cleaner, but more effort for other clients.

karalabe commented 3 years ago

@qianbin

Assume we want to prune duplicated trie nodes between block number [M, N), where N satisfies "chain immutability threshold", and there is a node iterator which can filter out trie nodes by block number.

That's what you can't do :) You can have a state item be deleted in one block, and then recreated some number of blocks later. The recreation would insert the same nodes at the same paths as the initial ones before deletion, but when you're "pruning", you wouldn't know there's a later reference to the original data and you'd delete it. The state being 90GB is size, you can't duplicate shared subtries across blocks.

rjl493456442 commented 3 years ago

Tree of tries

The live tries can be organized in the tree model, a base layer with the singleton persisted trie and several layers with the trie diffs introduced in the relevant block.

In the persisted trie, all the newly added or updated trie nodes are flushed to the disk, may or may not overwritten the original version with the same trie path. The newly deleted trie nodes will be deleted from the disk. That's say, the persisted trie is supposed to be complete and no dangling trie nodes leak.

Deletion detection

Compared with the trie node addition and updating, deletion is slightly complicated. There are some cases for describing the deletion scenarios:

A slot is removed from the contract, which leads to a shortNode deleted from the parent fullNode.
A slot is removed from the contract, which leads to a shortNode deleted and the parent fullNode is embedded into its parent(shortNode), 2 nodes are deleted.
The contract is deleted, the entire storage is deleted.

But these deletion information are available in the Trie. That's say, for each block, we can obtain a list of deleted trie nodes and mark them as DELETED in the diff layer. These nodes shouldn't be accessed later and eventually get deleted from the disk.

Note, the correctness won't be affected if the deleted nodes are not removed from the disk, but it's not good to have the dangling trie nodes.

The relationship with in-memory pruner

In the tries-tree, the bloom diff layer can contain several diff layers by merging the trie node mutations together. It's pretty useful for the top nodes(e.g. root node) seems they are usually modified in each block. Thus the real disk writes are reduced.

Although the bottom diff layer can act as the "in-memory pruner" to some extend, but it's still not that efficient like the reference-counter based in-memory pruner, mainly the following points:

In reference-counter based pruner, if the node is not referenced then it's not persisted. But in the layered tries, the bottom diff layer must be persisted even the data inside is already stale

qianbin commented 3 years ago

@qianbin

Assume we want to prune duplicated trie nodes between block number [M, N), where N satisfies "chain immutability threshold", and there is a node iterator which can filter out trie nodes by block number.

That's what you can't do :) You can have a state item be deleted in one block, and then recreated some number of blocks later. The recreation would insert the same nodes at the same paths as the initial ones before deletion, but when you're "pruning", you wouldn't know there's a later reference to the original data and you'd delete it. The state being 90GB is size, you can't duplicate shared subtries across blocks.

Yes, the recreation produced the same nodes content, but according to the new storage scheme, these nodes are totally new and have larger block number prefix (larger lexicographical order). So delete older nodes won't break the new trie.

I forgot to tell the side effect of the new scheme. With path as prefix, trie nodes iteration will be much faster than with bare hash key. With path and block number combined prefix, we can get very high compression ratio ( > 2 in practice).

rjl493456442 commented 3 years ago

Node path encoding

In the new state scheme, the trie node key is encoded by node path in order to get rid of data redundancy. Since upgrading state scheme requires a node resync. It's a super hard step for us. We should carefully pick the state scheme. I have a few key format proposals, just list them here for some inputs.

Note for storage trie, the key is prefixed with an additional 32bytes namespace. It's the corresponding account hash. The reason is described above by Peter. The path discussed here refers to the node path inside the tree.

COMPACT encoding

In Geth, usually the path is encoded with COMPACT format which is defined by Yellow paper. It puts the flag in the high nibble and put all remaining key bytes in the following position. One proposal is to combine the COMPACT key encoding with a prefix as the trie node key.

The key scheme can be defined as this

account trie node key = Prefix(1byte) || COMPACTED(node_path)
storage trie node key = Prefix(1byte) || account hash(32byte) || COMPACTed(node_path)

Reverse COMPACT encoding

The drawback of COMPACT encoding is it puts the flag in the high nibble. In the leveldb, the key shared with the previous entry will not be stored. It will break the continuity of database keys.

For example, we have two key paths:

hex: []byte{1, 2, 3, 4, 5} (no terminator)
hex: []byte{1, 2, 3, 4, 5, 6} (no terminator)

The COMPACT encoding are:

compact: []byte{0x11, 0x23, 0x45}
compact: []byte{0x00, 0x12, 0x34, 0x56}

You can see these two continuous keys are encoded with totally different result. We lose a lot of nice database properties, like data locality.

So we can twist the COMPACT encoding by putting the flag nibble in the end.

// REVERSE-COMAPCT encoding is used for encoding trie node path in the trie node
// database key. The main difference with COMPACT encoding is that the key flag
// is put in the end of the key.
//
// e.g.
// - the key [] is encoded as [0x00]
// - the key [0x1, 0x2, 0x3] is encoded as [0x12, 0x31]
// - the key [0x1, 0x2, 0x3, 0x0] is encoded as [0x12, 0x30, 0x00]
func hexToReverseCompact(hex []byte) []byte {
    terminator := byte(0)
    if hasTerm(hex) {
        terminator = 1
        hex = hex[:len(hex)-1]
    }
    buf := make([]byte, len(hex)/2+1)
    buf[len(buf)-1] = terminator << 1 // the flag byte
    if len(hex)&1 == 1 {
        buf[len(buf)-1] |= 1                    // odd flag
        buf[len(buf)-1] |= hex[len(hex)-1] << 4 // last nibble is contained in the last byte
        hex = hex[:len(hex)-1]
    }
    decodeNibbles(hex, buf[:len(buf)-1])
    return buf
}

These two paths in this example will be encoded like this:

reverse-compact: []byte{0x12, 0x34, 0x51}
reverse-compact: []byte{0x12, 0x34, 0x56, 0x00}

So now most of the keys can be shared with each other. A lot of database space can be saved and also data locality is preserved.

Right-padding Reverse COMACT encoding

Reverse-COMPACT is still not perfect, since the trie node paths are not sorted strictly by the path. Let's see an example:

node A: hex: []byte{1, 2, 3, 4, 5} (no terminator)
node B: hex: []byte{1, 2, 3, 4, 5, 0} (no terminator)

The encoded key is:

node A: reverse-compact: []byte{0x12, 0x34, 0x51}
node B: reverse-compact: []byte{0x12, 0x34, 0x50, 0x00}

The encoded path A is literally larger than encoded path B, while the raw path is opposite.

This encoding format will have these drawback:

The ordered node key iteration is not supported
The key length is dynamic, can mix with legacy node key(32bytes)

So another encoding format can be considered, of course with trade-off. It's called right-padding reverse compact encoding.

The difference between the reverse-compact encoding is that the path will be right-padding with zero bytes until the fixed length, and the flag will be put in the end. So no matter how long the raw path is, the encoded path will have the fixed size. But yes, the tradeoff is the encoded key is filled with useless zero bytes.

Still, in this example, let's assume the fixed length is 4:

node A: right-padding-reverse-compact: []byte{0x12, 0x34, 0x50, 0x00, 0x05}
node B: right-padding-reverse-compact: []byte{0x12, 0x34, 0x50, 0x00, 0x06}

For node A, the last nibble byte(5) concats another padded zero nibble, and the key is filled up to 4 bytes length. The last byte is a flag for indicating the key length.

For node B, the raw path []byte{1, 2, 3, 4, 5, 0} is encoded to []byte{12, 34, 50}. The also the last byte is a special flag indicating length information.

In ethereum, the path length should be 32. It means we will have countless zero bytes in node key. It's a huge disk space waste. Fortunately, the underlying database help us to compress the key. So in theory, the zero bytes should be compressed well.

qianbin commented 3 years ago

Node path encoding

In the new state scheme, the trie node key is encoded by node path in order to get rid of data redundancy. Since upgrading state scheme requires a node resync. It's a super hard step for us. We should carefully pick the state scheme. I have a few key format proposals, just list them here for some inputs.

Note for storage trie, the key is prefixed with an additional 32bytes namespace. It's the corresponding account hash. The reason is described above by Peter. The path discussed here refers to the node path inside the tree.

COMPACT encoding

In Geth, usually the path is encoded with COMPACT format which is defined by Yellow paper. It puts the flag in the high nibble and put all remaining key bytes in the following position. One proposal is to combine the COMPACT key encoding with a prefix as the trie node key.

The key scheme can be defined as this
account trie node key = Prefix(1byte) || COMPACTED(node_path)
storage trie node key = Prefix(1byte) || account hash(32byte) || COMPACTed(node_path)
Reverse COMPACT encoding

The drawback of COMPACT encoding is it puts the flag in the high nibble. In the leveldb, the key shared with the previous entry will not be stored. It will break the continuity of database keys.

For example, we have two key paths:

hex: []byte{1, 2, 3, 4, 5} (no terminator)

hex: []byte{1, 2, 3, 4, 5, 6} (no terminator)

The COMPACT encoding are:

compact: []byte{0x11, 0x23, 0x45}

compact: []byte{0x00, 0x12, 0x34, 0x56}

You can see these two continuous keys are encoded with totally different result. We lose a lot of nice database properties, like data locality.

So we can twist the COMPACT encoding by putting the flag nibble in the end.
// REVERSE-COMAPCT encoding is used for encoding trie node path in the trie node
// database key. The main difference with COMPACT encoding is that the key flag
// is put in the end of the key.
//
// e.g.
// - the key [] is encoded as [0x00]
// - the key [0x1, 0x2, 0x3] is encoded as [0x12, 0x31]
// - the key [0x1, 0x2, 0x3, 0x0] is encoded as [0x12, 0x30, 0x00]
func hexToReverseCompact(hex []byte) []byte {
  terminator := byte(0)
  if hasTerm(hex) {
      terminator = 1
      hex = hex[:len(hex)-1]
  }
  buf := make([]byte, len(hex)/2+1)
  buf[len(buf)-1] = terminator << 1 // the flag byte
  if len(hex)&1 == 1 {
      buf[len(buf)-1] |= 1                    // odd flag
      buf[len(buf)-1] |= hex[len(hex)-1] << 4 // last nibble is contained in the last byte
      hex = hex[:len(hex)-1]
  }
  decodeNibbles(hex, buf[:len(buf)-1])
  return buf
}
These two paths in this example will be encoded like this:

reverse-compact: []byte{0x12, 0x34, 0x51}

reverse-compact: []byte{0x12, 0x34, 0x56, 0x00}

So now most of the keys can be shared with each other. A lot of database space can be saved and also data locality is preserved.

Right-padding Reverse COMACT encoding

Reverse-COMPACT is still not perfect, since the trie node paths are not sorted strictly by the path. Let's see an example:

node A: hex: []byte{1, 2, 3, 4, 5} (no terminator)

node B: hex: []byte{1, 2, 3, 4, 5, 0} (no terminator)

The encoded key is:

node A: reverse-compact: []byte{0x12, 0x34, 0x51}

node B: reverse-compact: []byte{0x12, 0x34, 0x50, 0x00}

The encoded path A is literally larger than encoded path B, while the raw path is opposite.

This encoding format will have these drawback:

The ordered node key iteration is not supported

To-be-filled

So another encoding format can be considered, of course with trade-off. It's called right-padding reverse compact encoding.

The difference between the reverse-compact encoding is that the path will be right-padding with zero bytes until the fixed length, and the flag will be put in the end. So no matter how long the raw path is, the encoded path will have the fixed size. But yes, the tradeoff is the encoded key is filled with useless zero bytes.

Still, in this example, let's assume the fixed length is 4:

node A: right-padding-reverse-compact: []byte{0x12, 0x34, 0x50, 0x00, 0x05}

node B: right-padding-reverse-compact: []byte{0x12, 0x34, 0x50, 0x00, 0x06}

For node A, the last nibble byte(5) concats another padded zero nibble, and the key is filled up to 4 bytes length. The last byte is a flag for indicating the key length.

For node B, the raw path []byte{1, 2, 3, 4, 5, 0} is encoded to []byte{12, 34, 50}. The also the last byte is a special flag indicating length information.

In ethereum, the path length should be 32. It means we will have countless zero bytes in node key. It's a huge disk space waste. Fortunately, the underlying database help us to compress the key. So in theory, the zero bytes should be compressed well.

I have another fixed-length scheme suggestion which was applied in my project. It compact node path into uint64 (8 bytes). The first 60 bits is filled up to 15 nibbles and the rest 4 bits indicates path length.

// path64 uses uint64 to present the trie node path and follows the order of trie node iteration.
// Paths longer than 15 will be trimmed to 15.
type path64 uint64

// newPath64 convert the trie node path into path64.
func newPath64(path []byte) path64 {
    n := len(path)
    if n > 15 {
        n = 15
    }

    var p path64
    for i := 0; i < 15; i++ {
        if i < n {
            p |= path64(path[i])
        }
        p <<= 4
    }
    return p | path64(n)
}

For those nodes with path length > 15, we can alloc a new space for them. 16 ^ 15 is big, so there won't be many long path nodes .

rjl493456442 commented 3 years ago

@qianbin It's worthwhile for experiments. Thanks for sharing.

axic commented 3 years ago

This may have been considered while drafting up the above plan, but if not, I'd suggest for consideration the key encoding scheme presented in https://notes.ethereum.org/@vbuterin/verkle_tree_eip#Specification. I know this is work in progress (and cannot be used as-is because it changes other assumptions) and potentially under redesign currently, but if these schemes could be somewhat aligned, that could save some headache (recoding / resyncing) down the line.

rjl493456442 commented 3 years ago

@axic thanks for remaining. The key encoding scheme I proposed here is for database key. It's also suitable for verkle tree.

To be more precise, there are two "keys": the node path in trie and the database key for storing the trie node. For the former one, MPT uses the account address hash or storage address hash as the node path, Verkle tree uses the new scheme. It's OK to design various schemes.
For the latter one, previously in Geth we use the hash of the trie node content as the database key, so we will have huge data redundancy. Now in order to remove these redundancy we want to change the database key scheme which is path based. So whenever we update the trie, all the nodes are updated in place and there is only a single trie persisted in the database.

rjl493456442 commented 2 years ago

One more thing about the new state scheme is: how can we support the archive mode, or should we still support the archive mode.

In order to support the archive mode, Geth needs to store all the state reverse diffs and some state checkpoints in order to regenerate the target state fast. But it's not scalable since the cost for maintaining the state checkpoints is too high. And also it's not "archive" in this sense.

qianbin commented 2 years ago

One more thing about the new state scheme is: how can we support the archive mode, or should we still support the archive mode.

In order to support the archive mode, Geth needs to store all the state reverse diffs and some state checkpoints in order to regenerate the target state fast. But it's not scalable since the cost for maintaining the state checkpoints is too high. And also it's not "archive" in this sense.

Have you benchmarked leaf iteration for a node-deduped trie which takes node path as key prefix? I think it's no much slower than traverse its snapshot. If it's true, state snapshots can be replaced by trie checkpoints (at least in case of snap sync).

rjl493456442 commented 2 years ago

@qianbin Yes, the path based state scheme will group the trie nodes so theoretically the trie accessing and iteration should be fast. Regarding retiring the snapshot, I'm not sure about it. Since we still need to resolve the internal trie nodes in order to access the leaves, while for snapshot it's unnecessary. The read amplification should also be considered.

qianbin commented 2 years ago

@qianbin Yes, the path based state scheme will group the trie nodes so theoretically the trie accessing and iteration should be fast. Regarding retiring the snapshot, I'm not sure about it. Since we still need to resolve the internal trie nodes in order to access the leaves, while for snapshot it's unnecessary. The read amplification should also be considered.

In my previous comment, there's an idea to put the block number onto the node (extends the hash pointer with the additional block number). If this idea is acceptable, then accurate snapshot is not necessary, instead, accumulated leaves is enough(easy to maintain). each leaf also has the corresponding block number. when access a state, and the block number of the current trie root node is larger than or equal to the leaf's, then further sub node lookups can be skipped once find a sub node whose block number is less than or equal to leaf's.

jongrun commented 1 year ago

@qianbin Yes, the path based state scheme will group the trie nodes so theoretically the trie accessing and iteration should be fast. Regarding retiring the snapshot, I'm not sure about it. Since we still need to resolve the internal trie nodes in order to access the leaves, while for snapshot it's unnecessary. The read amplification should also be considered.

For "Since we still need to resolve the internal trie nodes in order to access the leaves", is it possible to store the leaves with its path as key. If so, we can access the leaves by path directly instead of resolving the internal trie nodes.

jongrun commented 1 year ago

One more thing about the new state scheme is: how can we support the archive mode, or should we still support the archive mode.

In order to support the archive mode, Geth needs to store all the state reverse diffs and some state checkpoints in order to regenerate the target state fast. But it's not scalable since the cost for maintaining the state checkpoints is too high. And also it's not "archive" in this sense.

@rjl493456442 An immature idea that need't to maintain the state checkpoints is described here. The key point is that we can read trie nodes of a specific block height from reverse diffs, and bloom filter can accelerate the reading.

jongrun commented 1 year ago

@qianbin Yes, the path based state scheme will group the trie nodes so theoretically the trie accessing and iteration should be fast. Regarding retiring the snapshot, I'm not sure about it. Since we still need to resolve the internal trie nodes in order to access the leaves, while for snapshot it's unnecessary. The read amplification should also be considered.

In my previous comment, there's an idea to put the block number onto the node (extends the hash pointer with the additional block number). If this idea is acceptable, then accurate snapshot is not necessary, instead, accumulated leaves is enough(easy to maintain). each leaf also has the corresponding block number. when access a state, and the block number of the current trie root node is larger than or equal to the leaf's, then further sub node lookups can be skipped once find a sub node whose block number is less than or equal to leaf's.

@qianbin I'm very interested in your idea, but there are some points I don't understand. Is there any complete or further explanation about it?

qianbin commented 1 year ago

@qianbin I'm very interested in your idea, but there are some points I don't understand. Is there any complete or further explanation about it?

Here's the entry point of the implementation codes https://github.com/vechain/thor/blob/6aa5f94a1e379662931dd153e577fcc17239c56b/muxdb/internal/trie/trie.go#L209

jongrun commented 1 year ago

@qianbin I'm very interested in your idea, but there are some points I don't understand. Is there any complete or further explanation about it?

Here's the entry point of the implementation codes https://github.com/vechain/thor/blob/6aa5f94a1e379662931dd153e577fcc17239c56b/muxdb/internal/trie/trie.go#L209

Thanks. I'll read it. But I have two questions to check with you in advance.

Does the scheme support archive states in a certain range of blocks? Like the issue #26981 describes.
Is it compatible with the old storage scheme?

Looking forward to your reply.

qianbin commented 1 year ago

Thanks. I'll read it. But I have two questions to check with you in advance.

Does the scheme support archive states in a certain range of blocks? Like the issue Can we archive states for the last N months only? #26981 describes.

Is it compatible with the old storage scheme?

Looking forward to your reply.

Our solution is a bit like WAL(write-ahead logging). So yes and It's configurable. The current range is 65536, since our chain allows contracts to access history states of last 65536 blocks.
No. Both keys and values are in new format.

BTW. If your project is based on Geth, I don't think our implementation is much helpful for you, since it's not easy to migrate changes to Geth.

jongrun commented 1 year ago

Thanks. I'll read it. But I have two questions to check with you in advance.

Does the scheme support archive states in a certain range of blocks? Like the issue Can we archive states for the last N months only? #26981 describes.

Is it compatible with the old storage scheme?

Looking forward to your reply.

Our solution is a bit like WAL(write-ahead logging). So yes and It's configurable. The current range is 65536, since our chain allows contracts to access history states of last 65536 blocks.

No. Both keys and values are in new format.

BTW. If your project is based on Geth, I don't think our implementation is much helpful for you, since it's not easy to migrate changes to Geth.

I'm looking for a schema that don't have to offline prune and can maintain states of last N blocks dynamicly. I think your implementation may be helpful. Further discussion for the two questions:

Using WAL, will the block range of history states influence the reading speed? That is, if the range is last 3 months, is the reading cost tolerable?
For the 2nd question, my point is that, is the algorithm of state/storage root the same as the old scheme? On the other word, will they get a equal state root for the same data set (both accounts and contract storage are totally the same).

jongrun commented 1 year ago

Thanks. I'll read it. But I have two questions to check with you in advance.

Does the scheme support archive states in a certain range of blocks? Like the issue Can we archive states for the last N months only? #26981 describes.

Is it compatible with the old storage scheme?

Looking forward to your reply.

Our solution is a bit like WAL(write-ahead logging). So yes and It's configurable. The current range is 65536, since our chain allows contracts to access history states of last 65536 blocks.

No. Both keys and values are in new format.

BTW. If your project is based on Geth, I don't think our implementation is much helpful for you, since it's not easy to migrate changes to Geth.

I'm looking for a schema that don't have to offline prune and can maintain states of last N blocks dynamicly. I think your implementation may be helpful. Further discussion for the two questions:

Using WAL, will the block range of history states influence the reading speed? That is, if the range is last 3 months, is the reading cost tolerable?
For the 2nd question, my point is that, is the algorithm of state/storage root the same as the old scheme? On the other word, will they get a equal state root for the same data set (both accounts and contract storage are totally the same).

qianbin commented 1 year ago

Using WAL, will the block range of history states influence the reading speed? That is, if the range is last 3 months, is the reading cost tolerable?

Nop. Neither reading nor writing is affected. A full-archived node(at 1/10 the scale of eth mainnet) keeping full WAL is as fast as pruning nodes.

For the 2nd question, my point is that, is the algorithm of state/storage root the same as the old scheme? On the other word, will they get an equal state root for the same data set (both accounts and contract storage are totally the same).

Yes, the same.

jongrun commented 1 year ago

Using WAL, will the block range of history states influence the reading speed? That is, if the range is last 3 months, is the reading cost tolerable?

Nop. Neither reading nor writing is affected. A full-archived node(at 1/10 the scale of eth mainnet) keeping full WAL is as fast as pruning nodes.

For the 2nd question, my point is that, is the algorithm of state/storage root the same as the old scheme? On the other word, will they get an equal state root for the same data set (both accounts and contract storage are totally the same).

Yes, the same.

Nice. Thank you for your reply. I'll read the codes in detail.

Georgezhang714 commented 1 year ago

The difference between full nodes and archive nodes is that archive nodes allow you to query the historical state of the verification block

Georgezhang714 commented 1 year ago

I also thought leaves are at the same level B-Tree is defined by the term minimum degree t Every node except the root must contain at leleaves are at the same level. B-Tree is defined by the term minimum degree ast t-1keys

muang0 commented 1 year ago

+1 on @jongrun's thoughts- it would be really nice to be able to configure geth to store last N blocks

holiman commented 2 weeks ago

This ticket is moot / completed now that we use path-based storage

karalabe commented 2 weeks ago

Whadayamean? We implemented the issue described :D So completed yes, moot no :P

holiman commented 2 weeks ago

Completing the implementation made the ticket moot, since it is completed.

By closing the ticket, I obtain all credit for the work performed.

ethereum / go-ethereum

Trie pruning v9000 #23427

Reduplicate storage

Super finality

Tree of tries

Feasibility

Tree of tries

Deletion detection

The relationship with in-memory pruner

Node path encoding

COMPACT encoding

Reverse COMPACT encoding

Right-padding Reverse COMACT encoding

Node path encoding

COMPACT encoding

Reverse COMPACT encoding

Right-padding Reverse COMACT encoding