Closed lithp closed 3 years ago
The Turbo Geth database consists of a few sections:
I believe that Turbo Geth only stores the leaf nodes for each trie, it keeps all the intermediate nodes for the most recent trie in memory. For instance, the account storage section only stores sha(address) -> account
pairs. This means that supporting a hypothetical GetNodeDataByPath
requires running range queries over the database and generating nodes as they're requested.
I believe that account history is stored as (sha(address), block number) -> account state as of that block
. To find out what the state of a given account was as of a given block, you can perform a range query and search forward starting at (sha(address), block number)
. The first value you find is your answer, if you don't find any values then the state as of that block is the same as the current state. To rewind the account database to a given block number you need to perform the above operation on every account (If I'm guessing right, this must be why the block # -> changed keys
index was added.)
Turbo Geth calls their history store "reverse differences". It means pruning can be supported by dropping all keys (sha(address), block number)
where the block number is below a given value. It's complicated by CREATE2
. In a CREATE2
world SELFDESTRUCT
operations mean that tombstone values must be written to every storage item that contract owns. This could be a lot of tombstones!
From an email thread with someone else working on this problem. Leaving out some details and their identity in case they don't want this to be made public yet.
turbotrie has a couple other improvements based on simplifying the data model and being more space efficient. In terms of layout, nodes are stored as: (path, trie version #, trie root hash) -> object
(I believe). This also supports pruning away older versions.
Intermediate nodes are important to have available:
Turbo Geth appears to only store leaves (I think it keeps a single copy of the trie in memory which it modifies in-place). This means that even a modified form of GetNodeData
can only serve requests for the current trie. Syncing clients must constantly pivot in order to take advantage of Turbo Geth servers.
(root hash, path) -> obj
could be used to store and quickly serve nodes. It's also very easy to prune, drop hashes which belong to blocks which have been pruned away.
It's not necessary to store intermediate nodes if they can be quickly generated when required.
I believe that many of the changes will need to be made to py-evm, maybe this issue should be transferred there.
eth.vm.base.VM.import_block()
calls self.build_state
, which returns a BaseState
based at the correct state root for executing the given block on top of. This method might have to be changed. We might have to generate the correct state and state root by first rewinding the database.
eth.vm.state.BaseState(db: BaseAtomicDB, execution_context: ExecutionContext, state_root: bytes)
assumes that all you need is a db
and a state_root
. Supporting rewinding might require passing in additional information, such as the current block number?
Likewise, eth.db.account.AccountDB(BaseAtomicDB, state_root)
might instead/also want to accept the block number.
It might be possible to leave eth.db.chain.ChainDB
and eth.db.header.HeaderDB
unchanged.
ChainDB
has get
and exists
methods. Either ChainDB
should always be rooted in the correct block or no callers should be using these to ask about account/storage data.
eth.vm.base.VM.validate_block()
calls self.chaindb.exists(block.header.state_root)
Trinity mostly lives in a purely functional world. The Turbo-Geth layout uses a canonical view of the current state and current history. This means there can be multiple read-only views of the database, but only one writer at any given time. Going the Turbo-Geth route means being careful about which state the database currently represents and only having one writing instance of the database open at a time.
This is going to be a large undertaking, what's the smallest useful change which can be tested to work and then expanded upon?
Jason suggested running both databases in parallel. I can implement some subset of functionality, such as storing reverse diffs of the account history, and have all calls persist both the old format and the new format. Reads also read from both databases and fail if any discrepancies are noticed. This allows starting with some subset of the functionality and then expanding it to include everything the old format supported.
Potential starting points:
Add an --experimental-db
flag to Trinity which enables writing and reading from both databases in parallel.
bcForkStreeTest
might cover this)eth_getStorageAt
returns the storage item for the given address, as of the given block numbereth_getBalance
returns the balance of the given address as of the given block numbereth_getTransactionCount
returns the number of transactions a given address has sent, as of a given block number.eth_getCode
returns the code for a given address, as of a given block numbereth_call
executes a message call against the state as of a given blocketh_getProof
proves a given address + storage value as of the given block number (this one seems very difficult to implement without saving tries on disk)GetNodeData
can ask for any node in any account or storage trie, though in practice peers only serve tries from the most recent 64 blocks.I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.
I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.
Yeah, I absolutely agree! My hope is that there's a small core of functionality here that can be implemented quickly and then expanded upon. I agree that some features (like eth_getProof
) don't just not need to happen soon, they probably don't need to happen before beta.
Almost all the RPCs which require looking into history require a lookup on {address, block #} -> Account
. It's probably too expensive to materialize this for every block. The next best thing would be if we could lookup the state of an account with a leveldb range query that returns the first row it finds (As best I can tell range queries returning a single row should be nearly as fast as single key lookups, both of them involve the same binary search through the data, though this is worth testing!)
The leveldb docs state that forward range queries are "somewhat" faster than backward range queries. It's worth experimenting to see how drastic the difference is, but that probably means the common case should be scanning forward from (address, block #)
and returning the first value (assuming it matches the current account).
This means (address, block #)
is preferable to (block #, address)
, and it also means that what Turbo Geth calls "reverse diffs" must be stored: (address, block #) -> value
means that address
was value
up to and until block #
.
Storing reverse diffs means that pruning the oldest blocks is relatively simple, though it makes rewinding blocks a little complicated.
This might be the most complicated operation. The problem with (address, block #)
is that it assumes there are no forks, it can only represent the history of the canonical chain. However, sometimes (during reorgs) we'll want to import blocks which are not children of the tip of the chain. How should this work?
If the block to be imported (the importing block) is a child of a block in the canonical chain, build_state
can be simply a view over the database which acts in the same way the RPCs act, each account lookup is actually a range query which returns the first row. Writes are cached in memory and eventually stored in a format which I'll discuss in a few paragraphs.
It's more complicated if the block to be imported is a child of a block which is also not canonical. In that case trinity will have to find the latest ancestor which is part of the canonical chain, and then lookup and apply the state transitions for all ancestor non-canonical blocks (all of this in-memory), and finally compute the state transitions for the block to be imported. If this block is the new canonical block then the database will have to be rewound and the state transitions from the newly canonical blocks applied. (A diagram is probably appropriate here)
This leaves a few questions:
I believe that this requires an index of block hash -> state changes
. Since maintaining this and also maintaining (address, block #)
would mean that history is kept twice, maybe block hash -> state changes
is only saved for non-canonical blocks. When the canonical fork changes trinity swaps out some blocks from each store into the other.
Rewinding the state for a single account is easy, however rewinding the state for all accounts is difficult unless you already know which accounts need to be changed, and for this (block #) -> changed accounts
is an important index to have.
All of the above is pretty complicated, and it's complicated because of the (address, block #)
index which was added to improve RPC performance. If RPCs were allowed to be much slower (or if users who wanted fast history-searching RPCs had to enable a flag which created the index and made their database larger) then trinity could get by with a simpler schema:
address -> account
To store the current stateblock hash -> changed accounts
To store the history (this might still use reverse diffs?)importing non-canonical blocks still works the same way, an in-memory view of the database is created which undoes and applies transitions to the current state until it's ready for the importing block to be run. To make very old states easier to reach checkpoints could be created which materialize the entire state at different points along the chain.
It's possible to quickly serve single-account requests using the address, block # -> account
layout, however that layout makes iterating over a bunch of accounts (something Firehose's GetStateRanges
requires) slower than it could be. As an optimization, the current state can be stored in a different part of the database. (Turbo Geth does this) If it's duplicated into the new section writes are a little slower, if the current state is not kept as part of the history then reads are a little more complicated.
Phase 0: Preparation
scripts/enable_experimental_db.py
is added which sets the key and also triggers any migrations (none during phase 0).Chain.import_block()
(or something in it's call path) opens a file-lock which prevents multiple blocks from being imported at once, and probably emits a warning if the lock every blocks.Phase 1: Current Account State
block hash -> state changes
, which can be used to replay or unwind the changes each block made to the current account state database.In more detail:
BaseState
's constructor is refactored, it accepts a BlockHeader
instead of a state root. Most of the BaseState
implementation is moved into a sub-class which still accepts that state root, so tests continue working.TurboState(BaseState)
is created, which accepts a BaseState
and forwards all reads and writes to it, but also reads from the current account db, and writes diffs to a block hash -> changed accounts
index.ChainDB.persist_block()
saves headers and notices that the canonical chain has changed it also updates the current account db, by looking up and applying the block diffs.TurboState.persist()
(or, something in VM.import_block()
) saves the account db to the exisiting trie store, but also saves the diffs to a block hash -> changed accounts
index.Phase 2: Current Storage Item State
Phase 3: Reverse account diffs
Phase 4:
path -> obj
) format and each block also saves trie diffs.Some open items:
AccountDB
. It'd be great if I didn't have to re-implement it! How can TurboState
use a new layout while still taking advantage of AccountDB
's journaling?Chain.from_genesis()
.MiningChain.apply_transaction()
calls persist()
a couple times on the same block. For this to work in a world where persist()
saves block diffs, it will also have to update those block diffs, when appropriate. persist()
is also called by configure_homestead_header
, so maybe it makes more sense for something else to trigger a block diff save.Getting back onto this train, I still think it's important for Trinity also also for Ethereum 1.x that we give adopting the TurboGeth layout a real shot.
Saving block diffs is mostly finished, though it needs some more thorough testing. Today I started working on the current account state database, I think I have a prototype, now I need to write a bunch of tests for it.
Some open questions:
The current account database, when hooked up, passes all core tests and also all fixtures. Though I should write some more tests!
Some things left on my list:
test_block_diffs
needs some reworking to accomodate the fact that block diffs are now based off the previous blockTurboAccountDB
, I think most of this logic should instead live in AccountDB
.MiningChain
should be refactored to use a single VM instance which keeps track of the current pending state. This would fix the above bug. It also should probably improve performance!
BlockDiff
s include all the changes. The BlockDiff
s are only being tested implicitly. when accounts are read from the database.make_state_root()
should break the part of the code which reads accounts from the turbodb. If we're halfway through block execution then the turbodb is going to be out of date, and all requests need to be served from the journal. However, I think calls to make_state_root()
squash the journal and push changes to disk?from_genesis
is called? Maybe even do it in from_genesis
?GetNodeData
Prompted by @pipermerriam, I spent some time looking at how a TurboGeth-db-powered Trinity could respond to GetNodeData
requests. Trinity will need to fire off GetNodeData
requests to patch up the trie it gets from Firehose sync, so to be a good network citizen it also needs to be able to serve GetNodeData
. (And hopefully a future version of eth, eth/65, will be something easier for us to serve, so this complication can be removed)
I looked a little into how Geth stores tries. The important parts are trie/database.go and core/blockchain.go. Geth stores 128 tries in memory, though I think it never keeps a full trie in memory, just 128 tries worth of dirty nodes. The in-memory nodes have reference counts which makes garbage collecting them easier. Upon shutdown three tries are flushed to disk. As far as I can tell, when Geth first starts up it's only able to serve GetNodeData
requests for the last two tries it knows about, it doesn't support serving the full 128 tries until it's been running for a while.
I think Trinity could do something similar. To be a good network citizen, Trinity needs to have a store from hash -> node
which encompasses 128 tries worth of nodes. It needs to be able to easily evict tries from this store and add tries to this store.
A silly answer is to build a store which doesn't support trie eviction. Every so often the store is wiped away and the most recent trie regenerated. leveldb loves bulk writes so this might be a quick operation.
A simple answer is to store a single copy of the trie in disk and read it into memory upon startup and then do approximately what Geth does. All the intermediate nodes constitute 2.5GB of memory which might balloon into 5GB with python overhead which is probably disqualifying.
I think something similar will work though! If a single copy of the trie is stored on disk the remaining 127 can be kept in memory and the on-disk trie periodically updated. Updating the trie requires some kind of reference counting: the store is node_hash -> (parent_count, node)
. This is not cheap, every node has up to 16 children so I expect that when a new trie is added even more nodes will need to be updated than added. I think this is okay though. I'm hand waving but there's no need to update the trie every time a block comes in, we could update the trie every 20 blocks, for instance.
This way we can:
GetNodeData
requests for the last 128 tries.The current account database is nearing completion. Some things which now work:
MiningChain
, which was necessary for getting all the tests to pass.Some next steps:
A todo list with smaller items:
test_eth1_chaindb
tests were broken by all these changes and should be updatedb''
is stored in the database when accounts are deleted. The account should also be deleted from the database!BlockDiff
s might be getting written multiple times, which is inefficient. I should look into whether this is really happening and stop it if it is.Exciting! Do you have a wip branch online (the linked one doesn't seem to be up to date)?
Seconding what @cburgdorf said. Can you make sure your work is available for us to look at. Visibility into in-progress work is important!
Yeah, sorry, I thought I had already posted it @cburgdorf ! Here's a compare view for the branch I've been using: https://github.com/ethereum/py-evm/compare/master...lithp:lithp/turbo-account-state It's still very much so in "make it work" mode, this will have to be cleaned up a lot before it can be merged in, but here it is!
@lithp you should open a "Draft" PR for that so that it is more easily findable.
Here's the draft PR: https://github.com/ethereum/py-evm/pull/1879
What is wrong?
In order for Trinity to benefit from the improved sync speed Firehose makes possible it needs to change the way it lays out data on-disk. Specifically, account and storage data is currently stored as
sha(object) -> object
, which means that syncing requires a lot of random reads and writes.How can it be fixed
I'm not sure what the best approach is, I'll add some comments below exploring the options.
To limit scope I'm going to try to stick to using leveldb, I think that switching it out for a different database might lead to further performance improvements but would drastically increase the amount of work required.
Some requirements & opportunities
The new layout needs to be able to handle re-orgs, meaning it needs to be able to rewind blocks. This requires that some kind of history is kept.
The new layout needs to be able to quickly generate chunks of leaves and serve them to clients, so other clients will be able to quickly sync.
In order to be a good citizen on the network the new layout needs to be able to quickly serve
eth/63
andles/2
. In particular, requests likeGetNodeData
require keeping around a materialized trie or maybe more. Alternatively, we might want to suggest aeth/64
which does nothing but replaceGetNodeData
with something easier to serve.The new layout also needs to support quick responses to all the JSON-RPCs.
Currently Trinity runs as an archive node and doesn't perform any pruning. In fact,
sha(object) -> object
makes pruning very difficult because it naturally de-dupes, and the only way to know that you've removed the last reference to an object is to do a very expensive garbage collection pass of the entire database. Ideally the new layout would make pruning easier so that trinity won't have to save everything.Trie nodes in random order are about the least compressible disk pages possible. Storing all the accounts next to other is likely to improve the compression ratio of the database (leveldb compresses on a per-page basis)