Adopt the Turbo-Geth database layout

ethereum / trinity

The Trinity client for the Ethereum network

https://trinity.ethereum.org

MIT License

473 stars 145 forks source link

Adopt the Turbo-Geth database layout #779

Closed lithp closed 3 years ago

lithp commented 5 years ago

What is wrong?

In order for Trinity to benefit from the improved sync speed Firehose makes possible it needs to change the way it lays out data on-disk. Specifically, account and storage data is currently stored as sha(object) -> object, which means that syncing requires a lot of random reads and writes.

How can it be fixed

I'm not sure what the best approach is, I'll add some comments below exploring the options.

To limit scope I'm going to try to stick to using leveldb, I think that switching it out for a different database might lead to further performance improvements but would drastically increase the amount of work required.

Some requirements & opportunities

The new layout needs to be able to handle re-orgs, meaning it needs to be able to rewind blocks. This requires that some kind of history is kept.
The new layout needs to be able to quickly generate chunks of leaves and serve them to clients, so other clients will be able to quickly sync.
In order to be a good citizen on the network the new layout needs to be able to quickly serve eth/63 and les/2. In particular, requests like GetNodeData require keeping around a materialized trie or maybe more. Alternatively, we might want to suggest a eth/64 which does nothing but replace GetNodeData with something easier to serve.
The new layout also needs to support quick responses to all the JSON-RPCs.
Currently Trinity runs as an archive node and doesn't perform any pruning. In fact, sha(object) -> object makes pruning very difficult because it naturally de-dupes, and the only way to know that you've removed the last reference to an object is to do a very expensive garbage collection pass of the entire database. Ideally the new layout would make pruning easier so that trinity won't have to save everything.
Trie nodes in random order are about the least compressible disk pages possible. Storing all the accounts next to other is likely to improve the compression ratio of the database (leveldb compresses on a per-page basis)

lithp commented 5 years ago

What Turbo Geth does

The Turbo Geth database consists of a few sections:

contract storage history
account history
contract storage
account storage
block hash -> block number
block # -> changed keys
\~a few more sections\~

I believe that Turbo Geth only stores the leaf nodes for each trie, it keeps all the intermediate nodes for the most recent trie in memory. For instance, the account storage section only stores sha(address) -> account pairs. This means that supporting a hypothetical GetNodeDataByPath requires running range queries over the database and generating nodes as they're requested.

I believe that account history is stored as (sha(address), block number) -> account state as of that block. To find out what the state of a given account was as of a given block, you can perform a range query and search forward starting at (sha(address), block number). The first value you find is your answer, if you don't find any values then the state as of that block is the same as the current state. To rewind the account database to a given block number you need to perform the above operation on every account (If I'm guessing right, this must be why the block # -> changed keys index was added.)

Turbo Geth calls their history store "reverse differences". It means pruning can be supported by dropping all keys (sha(address), block number) where the block number is below a given value. It's complicated by CREATE2. In a CREATE2 world SELFDESTRUCT operations mean that tombstone values must be written to every storage item that contract owns. This could be a lot of tombstones!

lithp commented 5 years ago

Turbotrie

From an email thread with someone else working on this problem. Leaving out some details and their identity in case they don't want this to be made public yet.

turbotrie has a couple other improvements based on simplifying the data model and being more space efficient. In terms of layout, nodes are stored as: (path, trie version #, trie root hash) -> object (I believe). This also supports pruning away older versions.

lithp commented 5 years ago

Should we store intermediate nodes?

Intermediate nodes are important to have available:
- They're used to generate proofs for les clients
- They're used by Fast Sync'ing clients
- They're used to verify root hashes so clients can validate blocks.
Turbo Geth appears to only store leaves (I think it keeps a single copy of the trie in memory which it modifies in-place). This means that even a modified form of GetNodeData can only serve requests for the current trie. Syncing clients must constantly pivot in order to take advantage of Turbo Geth servers.
(root hash, path) -> obj could be used to store and quickly serve nodes. It's also very easy to prune, drop hashes which belong to blocks which have been pruned away.
It's not necessary to store intermediate nodes if they can be quickly generated when required.

lithp commented 5 years ago

Which changes will need to be made?

I believe that many of the changes will need to be made to py-evm, maybe this issue should be transferred there.

eth.vm.base.VM.import_block() calls self.build_state, which returns a BaseState based at the correct state root for executing the given block on top of. This method might have to be changed. We might have to generate the correct state and state root by first rewinding the database.
eth.vm.state.BaseState(db: BaseAtomicDB, execution_context: ExecutionContext, state_root: bytes) assumes that all you need is a db and a state_root. Supporting rewinding might require passing in additional information, such as the current block number?
Likewise, eth.db.account.AccountDB(BaseAtomicDB, state_root) might instead/also want to accept the block number.
It might be possible to leave eth.db.chain.ChainDB and eth.db.header.HeaderDB unchanged.
ChainDB has get and exists methods. Either ChainDB should always be rooted in the correct block or no callers should be using these to ask about account/storage data.
- eth.vm.base.VM.validate_block() calls self.chaindb.exists(block.header.state_root)
Trinity mostly lives in a purely functional world. The Turbo-Geth layout uses a canonical view of the current state and current history. This means there can be multiple read-only views of the database, but only one writer at any given time. Going the Turbo-Geth route means being careful about which state the database currently represents and only having one writing instance of the database open at a time.

lithp commented 5 years ago

What's the MVP?

This is going to be a large undertaking, what's the smallest useful change which can be tested to work and then expanded upon?

Jason suggested running both databases in parallel. I can implement some subset of functionality, such as storing reverse diffs of the account history, and have all calls persist both the old format and the new format. Reads also read from both databases and fail if any discrepancies are noticed. This allows starting with some subset of the functionality and then expanding it to include everything the old format supported.

Potential starting points:

Holding onto the current state of all accounts
The history of all accounts
The current state of all storage items
The history of all storage items

Add an --experimental-db flag to Trinity which enables writing and reading from both databases in parallel.

lithp commented 5 years ago

Some Questions

[ ] Should intermediate nodes be stored on disk?
[ ] Should they be interspersed with the leaves?
[ ] Should the current state be stored in a different section of the database?
[ ] Should the current state be stored with the history (at the end?)
[ ] Should the history be striped by account or by version? (will a range query return all the versions for a key, or all the keys for a version)
[ ] Is there a database layout which doesn't require re-writing history as we switch between forks?
[ ] Is there a way to test switching between forks? Do the blockchain fixtures cover this? (bcForkStreeTest might cover this)
[ ] Is there a test suite for reading out old states, like the JSON-RPCs can?
[ ] How much slower is leveldb backward iteration than forward iteration? The docs say "somewhat slower". Is backward iteration a non-starter for common operations such as serving RPCs?
[ ] How big is the current state? Is it reasonable to store multiple copies of it?
[ ] How do you quickly compute the new state root after executing a block?

lithp commented 5 years ago

Some RPCs which involve inspecting the account / state history

eth_getStorageAt returns the storage item for the given address, as of the given block number
eth_getBalance returns the balance of the given address as of the given block number
eth_getTransactionCount returns the number of transactions a given address has sent, as of a given block number.
eth_getCode returns the code for a given address, as of a given block number
eth_call executes a message call against the state as of a given block
eth_getProof proves a given address + storage value as of the given block number (this one seems very difficult to implement without saving tries on disk)

Special Bonus

GetNodeData can ask for any node in any account or storage trie, though in practice peers only serve tries from the most recent 64 blocks.

pipermerriam commented 5 years ago

I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.

lithp commented 5 years ago

I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.

Yeah, I absolutely agree! My hope is that there's a small core of functionality here that can be implemented quickly and then expanded upon. I agree that some features (like eth_getProof) don't just not need to happen soon, they probably don't need to happen before beta.

lithp commented 5 years ago

Storing Account history

Almost all the RPCs which require looking into history require a lookup on {address, block #} -> Account. It's probably too expensive to materialize this for every block. The next best thing would be if we could lookup the state of an account with a leveldb range query that returns the first row it finds (As best I can tell range queries returning a single row should be nearly as fast as single key lookups, both of them involve the same binary search through the data, though this is worth testing!)

The leveldb docs state that forward range queries are "somewhat" faster than backward range queries. It's worth experimenting to see how drastic the difference is, but that probably means the common case should be scanning forward from (address, block #) and returning the first value (assuming it matches the current account).

This means (address, block #) is preferable to (block #, address), and it also means that what Turbo Geth calls "reverse diffs" must be stored: (address, block #) -> value means that address was value up to and until block #.

Storing reverse diffs means that pruning the oldest blocks is relatively simple, though it makes rewinding blocks a little complicated.

lithp commented 5 years ago

Reading and Writing older older states

This might be the most complicated operation. The problem with (address, block #) is that it assumes there are no forks, it can only represent the history of the canonical chain. However, sometimes (during reorgs) we'll want to import blocks which are not children of the tip of the chain. How should this work?

If the block to be imported (the importing block) is a child of a block in the canonical chain, build_state can be simply a view over the database which acts in the same way the RPCs act, each account lookup is actually a range query which returns the first row. Writes are cached in memory and eventually stored in a format which I'll discuss in a few paragraphs.

It's more complicated if the block to be imported is a child of a block which is also not canonical. In that case trinity will have to find the latest ancestor which is part of the canonical chain, and then lookup and apply the state transitions for all ancestor non-canonical blocks (all of this in-memory), and finally compute the state transitions for the block to be imported. If this block is the new canonical block then the database will have to be rewound and the state transitions from the newly canonical blocks applied. (A diagram is probably appropriate here)

This leaves a few questions:

When a non-canonical block is imported where are its state transitions stored?
When a non-canonical block is imported and some non-canonical ancestors have to be replayed, how are their state transitions looked up?
When a non-canonical fork becomes canonical, how is the database rewound, and how are the newly canonical transitions replayed?

I believe that this requires an index of block hash -> state changes. Since maintaining this and also maintaining (address, block #) would mean that history is kept twice, maybe block hash -> state changes is only saved for non-canonical blocks. When the canonical fork changes trinity swaps out some blocks from each store into the other.

Rewinding the state for a single account is easy, however rewinding the state for all accounts is difficult unless you already know which accounts need to be changed, and for this (block #) -> changed accounts is an important index to have.

lithp commented 5 years ago

A simpler alternative which drops support for RPCs

All of the above is pretty complicated, and it's complicated because of the (address, block #) index which was added to improve RPC performance. If RPCs were allowed to be much slower (or if users who wanted fast history-searching RPCs had to enable a flag which created the index and made their database larger) then trinity could get by with a simpler schema:

address -> account To store the current state
block hash -> changed accounts To store the history (this might still use reverse diffs?)

importing non-canonical blocks still works the same way, an in-memory view of the database is created which undoes and applies transitions to the current state until it's ready for the importing block to be run. To make very old states easier to reach checkpoints could be created which materialize the entire state at different points along the chain.

lithp commented 5 years ago

The current state

It's possible to quickly serve single-account requests using the address, block # -> account layout, however that layout makes iterating over a bunch of accounts (something Firehose's GetStateRanges requires) slower than it could be. As an optimization, the current state can be stored in a different part of the database. (Turbo Geth does this) If it's duplicated into the new section writes are a little slower, if the current state is not kept as part of the history then reads are a little more complicated.

lithp commented 5 years ago

Current Plan

Phase 0: Preparation

A key is added to the database which indicates that the experimental turbo layout is in use.
scripts/enable_experimental_db.py is added which sets the key and also triggers any migrations (none during phase 0).
Chain.import_block() (or something in it's call path) opens a file-lock which prevents multiple blocks from being imported at once, and probably emits a warning if the lock every blocks.

Phase 1: Current Account State

Firehose sync speed (also evm execution speed) will be improved by having a "current account state" database which records the current state of each account.
To maintain backwards compatibility (and ensure no bugs slip through) state tries will be saved just as they are now.
The existing state tries will be used to serve JSON-RPCs and to validate blocks.
Reads from the account state will be read both from the current state and from the state tries, an exception will be thrown if they don't match.
An index will be kept, block hash -> state changes, which can be used to replay or unwind the changes each block made to the current account state database.
To read from a state which is not the current state (say you're importing an uncle), py-evm first computes the path from the parent of the block to be imported to the current canonical tip. When it tries to lookup the value for some account it first looks at each of the block diffs along that path, then looks at the current state database.
When the canonical tip changes the current state database is updated alongside it.

In more detail:

BaseState's constructor is refactored, it accepts a BlockHeader instead of a state root. Most of the BaseState implementation is moved into a sub-class which still accepts that state root, so tests continue working.
A TurboState(BaseState) is created, which accepts a BaseState and forwards all reads and writes to it, but also reads from the current account db, and writes diffs to a block hash -> changed accounts index.
When ChainDB.persist_block() saves headers and notices that the canonical chain has changed it also updates the current account db, by looking up and applying the block diffs.
TurboState.persist() (or, something in VM.import_block()) saves the account db to the exisiting trie store, but also saves the diffs to a block hash -> changed accounts index.

Phase 2: Current Storage Item State

Storage items will also be stored, and their histories logged at part of the block diff

Phase 3: Reverse account diffs

History for accounts and storage items will be stored (in Turbo-Geth's reverse-diff format) and updated alongside the canonical chain.
This history is used to serve JSON-RPC requests (which should greatly speed them up)

Phase 4:

The current trie is stored (in path -> obj) format and each block also saves trie diffs.
The old trie store is no longer being used for anything and is dropped (which should speed up all writes and recover a lot of disk usage)

Some open items:

[ ] Since the database has a current state which might change out from under threads thread-safety should be top of mind. e.g. When if the current account state changes while an RPC is being processed?
[ ] A lot of the journaling happens in AccountDB. It'd be great if I didn't have to re-implement it! How can TurboState use a new layout while still taking advantage of AccountDB's journaling?
[ ] How is the genesis' state diff stored? This is relevant in Chain.from_genesis().
[ ] MiningChain.apply_transaction() calls persist() a couple times on the same block. For this to work in a world where persist() saves block diffs, it will also have to update those block diffs, when appropriate. persist() is also called by configure_homestead_header, so maybe it makes more sense for something else to trigger a block diff save.

lithp commented 5 years ago

Getting back onto this train, I still think it's important for Trinity also also for Ethereum 1.x that we give adopting the TurboGeth layout a real shot.

Saving block diffs is mostly finished, though it needs some more thorough testing. Today I started working on the current account state database, I think I have a prototype, now I need to write a bunch of tests for it.

Some open questions:

What's a good way to test the current account state database. The ultimate test is using it to run all the fixtures, but what are some representative unit tests I could write.
How will this database layout interact with beam sync? A partially-filled current account state db feels like a nightmare, I think they simply run in parallel and once Firehose has finished Trinity switches over to using it?
In phase 4, when the TurboGeth layout is shown to work, a lot of code can be dropped. However, Trinity still needs to keep around at least a single copy of the state trie. It's unclear exactly how this part should happen, and how trinity should make sure it stays up to date.

lithp commented 5 years ago

The current account database, when hooked up, passes all core tests and also all fixtures. Though I should write some more tests!

Some things left on my list:

If the user at some point upgrades to turbodb block diffs will not exist for older headers. If a block is imported based off one of those older blocks generating the block diff will fail! This won't be a problem for a while but it'll have to be addressed eventually. A block diff could be generated on the fly by diffing the two relevant state roots.
test_block_diffs needs some reworking to accomodate the fact that block diffs are now based off the previous block
Instead of a TurboAccountDB, I think most of this logic should instead live in AccountDB.
I think there's a bug in the current account database. MiningChain builds a new VM for each transaction, so the list of changed accounts will be inaccurate if we try to save a block diff. If we add a transaction, then add another transaction touching different accounts, then import the block (saving a block diff), the saved block diff should be missing the changes from the first transaction.
MiningChain should be refactored to use a single VM instance which keeps track of the current pending state. This would fix the above bug. It also should probably improve performance!
- Right now I'm passing an extra parameter: the state root of the previous block. This lets BlockDiff generation build a diff based on the state as of the previous state root. However, I think the better long-term solution is to always have a single vm process an entire block.
Right now I'm doing some weird things to assert that we're using a specific schema. I need to figure out a more principled way to do this.
There are no tests that the BlockDiffs include all the changes. The BlockDiffs are only being tested implicitly. when accounts are read from the database.
It's suspicious that all the tests pass even though there's at least one bug I know of, I should think of more failure modes and test for them.
I need to make sure this works pre-Byzantium. I believe that calls to make_state_root() should break the part of the code which reads accounts from the turbodb. If we're halfway through block execution then the turbodb is going to be out of date, and all requests need to be served from the journal. However, I think calls to make_state_root() squash the journal and push changes to disk?
If more problems arise it might make more sense to generate block diffs somewhere different. Given two state roots it's relatively easy to diff them and come up with the list of changes required to move from one to the other.
Perform a turbo upgrade in other places from_genesis is called? Maybe even do it in from_genesis?
I should write a re-org test which forks a bunch of different blocks off and checks that the turbodb stays consistent as different forks become canonical.

lithp commented 5 years ago

Serving `GetNodeData`

Prompted by @pipermerriam, I spent some time looking at how a TurboGeth-db-powered Trinity could respond to GetNodeData requests. Trinity will need to fire off GetNodeData requests to patch up the trie it gets from Firehose sync, so to be a good network citizen it also needs to be able to serve GetNodeData. (And hopefully a future version of eth, eth/65, will be something easier for us to serve, so this complication can be removed)

I looked a little into how Geth stores tries. The important parts are trie/database.go and core/blockchain.go. Geth stores 128 tries in memory, though I think it never keeps a full trie in memory, just 128 tries worth of dirty nodes. The in-memory nodes have reference counts which makes garbage collecting them easier. Upon shutdown three tries are flushed to disk. As far as I can tell, when Geth first starts up it's only able to serve GetNodeData requests for the last two tries it knows about, it doesn't support serving the full 128 tries until it's been running for a while.

I think Trinity could do something similar. To be a good network citizen, Trinity needs to have a store from hash -> node which encompasses 128 tries worth of nodes. It needs to be able to easily evict tries from this store and add tries to this store.

A silly answer is to build a store which doesn't support trie eviction. Every so often the store is wiped away and the most recent trie regenerated. leveldb loves bulk writes so this might be a quick operation.

A simple answer is to store a single copy of the trie in disk and read it into memory upon startup and then do approximately what Geth does. All the intermediate nodes constitute 2.5GB of memory which might balloon into 5GB with python overhead which is probably disqualifying.

I think something similar will work though! If a single copy of the trie is stored on disk the remaining 127 can be kept in memory and the on-disk trie periodically updated. Updating the trie requires some kind of reference counting: the store is node_hash -> (parent_count, node). This is not cheap, every node has up to 16 children so I expect that when a new trie is added even more nodes will need to be updated than added. I think this is okay though. I'm hand waving but there's no need to update the trie every time a block comes in, we could update the trie every 20 blocks, for instance.

This way we can:

quickly serve GetNodeData requests for the last 128 tries.
do most of the reference counting in memory, where updates are cheap
quickly startup after shutdowns
have a database which stays the same size even as Trinity runs for extended periods of time

lithp commented 4 years ago

The current account database is nearing completion. Some things which now work:

Reads are no longer restricted to the current chain tip. It can now read from arbitrary older blocks, which means blocks which are not children of the canonical tip can be imported.
It correctly supports the DAO fork, where state magically changed
Nearly all reads now go through it
All of this works even with MiningChain, which was necessary for getting all the tests to pass.

Some next steps:

All account reads need to go through the current account database
This also needs to support account storage.
That should be enough to support a prototype of Syncing Really Quickly™: add some remaining pieces and then try to manually build a fully synced trinity node. The biggest thing to add is something which can work backwards: generating the state trie using the current account database.

A todo list with smaller items:

[ ] The test_eth1_chaindb tests were broken by all these changes and should be updated
[ ] I should write a test the reverse diffs are applied in the right order.
[ ] I think that b'' is stored in the database when accounts are deleted. The account should also be deleted from the database!
[ ] I think BlockDiffs might be getting written multiple times, which is inefficient. I should look into whether this is really happening and stop it if it is.

cburgdorf commented 4 years ago

Exciting! Do you have a wip branch online (the linked one doesn't seem to be up to date)?

pipermerriam commented 4 years ago

Seconding what @cburgdorf said. Can you make sure your work is available for us to look at. Visibility into in-progress work is important!

lithp commented 4 years ago

Yeah, sorry, I thought I had already posted it @cburgdorf ! Here's a compare view for the branch I've been using: https://github.com/ethereum/py-evm/compare/master...lithp:lithp/turbo-account-state It's still very much so in "make it work" mode, this will have to be cleaned up a lot before it can be merged in, but here it is!

pipermerriam commented 4 years ago

@lithp you should open a "Draft" PR for that so that it is more easily findable.

lithp commented 4 years ago

Here's the draft PR: https://github.com/ethereum/py-evm/pull/1879

ethereum / trinity

Adopt the Turbo-Geth database layout #779

What is wrong?

How can it be fixed

Some requirements & opportunities

What Turbo Geth does

Turbotrie

Should we store intermediate nodes?

Which changes will need to be made?

What's the MVP?

Some Questions

Some RPCs which involve inspecting the account / state history

Special Bonus

Storing Account history

Reading and Writing older older states

A simpler alternative which drops support for RPCs

The current state

Current Plan

Serving GetNodeData

Serving `GetNodeData`