ethereum / trinity

The Trinity client for the Ethereum network
https://trinity.ethereum.org
MIT License
473 stars 145 forks source link

Adopt the Turbo-Geth database layout #779

Closed lithp closed 3 years ago

lithp commented 5 years ago

What is wrong?

In order for Trinity to benefit from the improved sync speed Firehose makes possible it needs to change the way it lays out data on-disk. Specifically, account and storage data is currently stored as sha(object) -> object, which means that syncing requires a lot of random reads and writes.

How can it be fixed

I'm not sure what the best approach is, I'll add some comments below exploring the options.

To limit scope I'm going to try to stick to using leveldb, I think that switching it out for a different database might lead to further performance improvements but would drastically increase the amount of work required.

Some requirements & opportunities

lithp commented 5 years ago

What Turbo Geth does

The Turbo Geth database consists of a few sections:

I believe that Turbo Geth only stores the leaf nodes for each trie, it keeps all the intermediate nodes for the most recent trie in memory. For instance, the account storage section only stores sha(address) -> account pairs. This means that supporting a hypothetical GetNodeDataByPath requires running range queries over the database and generating nodes as they're requested.

I believe that account history is stored as (sha(address), block number) -> account state as of that block. To find out what the state of a given account was as of a given block, you can perform a range query and search forward starting at (sha(address), block number). The first value you find is your answer, if you don't find any values then the state as of that block is the same as the current state. To rewind the account database to a given block number you need to perform the above operation on every account (If I'm guessing right, this must be why the block # -> changed keys index was added.)

Turbo Geth calls their history store "reverse differences". It means pruning can be supported by dropping all keys (sha(address), block number) where the block number is below a given value. It's complicated by CREATE2. In a CREATE2 world SELFDESTRUCT operations mean that tombstone values must be written to every storage item that contract owns. This could be a lot of tombstones!

lithp commented 5 years ago

Turbotrie

From an email thread with someone else working on this problem. Leaving out some details and their identity in case they don't want this to be made public yet.

turbotrie has a couple other improvements based on simplifying the data model and being more space efficient. In terms of layout, nodes are stored as: (path, trie version #, trie root hash) -> object (I believe). This also supports pruning away older versions.

lithp commented 5 years ago

Should we store intermediate nodes?

lithp commented 5 years ago

Which changes will need to be made?

I believe that many of the changes will need to be made to py-evm, maybe this issue should be transferred there.

lithp commented 5 years ago

What's the MVP?

This is going to be a large undertaking, what's the smallest useful change which can be tested to work and then expanded upon?

Jason suggested running both databases in parallel. I can implement some subset of functionality, such as storing reverse diffs of the account history, and have all calls persist both the old format and the new format. Reads also read from both databases and fail if any discrepancies are noticed. This allows starting with some subset of the functionality and then expanding it to include everything the old format supported.

Potential starting points:

Add an --experimental-db flag to Trinity which enables writing and reading from both databases in parallel.

lithp commented 5 years ago

Some Questions

lithp commented 5 years ago

Some RPCs which involve inspecting the account / state history

Special Bonus

pipermerriam commented 5 years ago

I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.

lithp commented 5 years ago

I'm very ok with intermediate solutions that fail to support some of the JSON-RPC methods if the trade-off means a functional client sooner.

Yeah, I absolutely agree! My hope is that there's a small core of functionality here that can be implemented quickly and then expanded upon. I agree that some features (like eth_getProof) don't just not need to happen soon, they probably don't need to happen before beta.

lithp commented 5 years ago

Storing Account history

Almost all the RPCs which require looking into history require a lookup on {address, block #} -> Account. It's probably too expensive to materialize this for every block. The next best thing would be if we could lookup the state of an account with a leveldb range query that returns the first row it finds (As best I can tell range queries returning a single row should be nearly as fast as single key lookups, both of them involve the same binary search through the data, though this is worth testing!)

The leveldb docs state that forward range queries are "somewhat" faster than backward range queries. It's worth experimenting to see how drastic the difference is, but that probably means the common case should be scanning forward from (address, block #) and returning the first value (assuming it matches the current account).

This means (address, block #) is preferable to (block #, address), and it also means that what Turbo Geth calls "reverse diffs" must be stored: (address, block #) -> value means that address was value up to and until block #.

Storing reverse diffs means that pruning the oldest blocks is relatively simple, though it makes rewinding blocks a little complicated.

lithp commented 5 years ago

Reading and Writing older older states

This might be the most complicated operation. The problem with (address, block #) is that it assumes there are no forks, it can only represent the history of the canonical chain. However, sometimes (during reorgs) we'll want to import blocks which are not children of the tip of the chain. How should this work?

If the block to be imported (the importing block) is a child of a block in the canonical chain, build_state can be simply a view over the database which acts in the same way the RPCs act, each account lookup is actually a range query which returns the first row. Writes are cached in memory and eventually stored in a format which I'll discuss in a few paragraphs.

It's more complicated if the block to be imported is a child of a block which is also not canonical. In that case trinity will have to find the latest ancestor which is part of the canonical chain, and then lookup and apply the state transitions for all ancestor non-canonical blocks (all of this in-memory), and finally compute the state transitions for the block to be imported. If this block is the new canonical block then the database will have to be rewound and the state transitions from the newly canonical blocks applied. (A diagram is probably appropriate here)

This leaves a few questions:

I believe that this requires an index of block hash -> state changes. Since maintaining this and also maintaining (address, block #) would mean that history is kept twice, maybe block hash -> state changes is only saved for non-canonical blocks. When the canonical fork changes trinity swaps out some blocks from each store into the other.

Rewinding the state for a single account is easy, however rewinding the state for all accounts is difficult unless you already know which accounts need to be changed, and for this (block #) -> changed accounts is an important index to have.

lithp commented 5 years ago

A simpler alternative which drops support for RPCs

All of the above is pretty complicated, and it's complicated because of the (address, block #) index which was added to improve RPC performance. If RPCs were allowed to be much slower (or if users who wanted fast history-searching RPCs had to enable a flag which created the index and made their database larger) then trinity could get by with a simpler schema:

importing non-canonical blocks still works the same way, an in-memory view of the database is created which undoes and applies transitions to the current state until it's ready for the importing block to be run. To make very old states easier to reach checkpoints could be created which materialize the entire state at different points along the chain.

lithp commented 5 years ago

The current state

It's possible to quickly serve single-account requests using the address, block # -> account layout, however that layout makes iterating over a bunch of accounts (something Firehose's GetStateRanges requires) slower than it could be. As an optimization, the current state can be stored in a different part of the database. (Turbo Geth does this) If it's duplicated into the new section writes are a little slower, if the current state is not kept as part of the history then reads are a little more complicated.

lithp commented 5 years ago

Current Plan

Phase 0: Preparation

Phase 1: Current Account State

In more detail:

Phase 2: Current Storage Item State

Phase 3: Reverse account diffs

Phase 4:


Some open items:

lithp commented 5 years ago

Getting back onto this train, I still think it's important for Trinity also also for Ethereum 1.x that we give adopting the TurboGeth layout a real shot.

Saving block diffs is mostly finished, though it needs some more thorough testing. Today I started working on the current account state database, I think I have a prototype, now I need to write a bunch of tests for it.

Some open questions:

lithp commented 5 years ago

The current account database, when hooked up, passes all core tests and also all fixtures. Though I should write some more tests!

Some things left on my list:

lithp commented 5 years ago

Serving GetNodeData

Prompted by @pipermerriam, I spent some time looking at how a TurboGeth-db-powered Trinity could respond to GetNodeData requests. Trinity will need to fire off GetNodeData requests to patch up the trie it gets from Firehose sync, so to be a good network citizen it also needs to be able to serve GetNodeData. (And hopefully a future version of eth, eth/65, will be something easier for us to serve, so this complication can be removed)

I looked a little into how Geth stores tries. The important parts are trie/database.go and core/blockchain.go. Geth stores 128 tries in memory, though I think it never keeps a full trie in memory, just 128 tries worth of dirty nodes. The in-memory nodes have reference counts which makes garbage collecting them easier. Upon shutdown three tries are flushed to disk. As far as I can tell, when Geth first starts up it's only able to serve GetNodeData requests for the last two tries it knows about, it doesn't support serving the full 128 tries until it's been running for a while.

I think Trinity could do something similar. To be a good network citizen, Trinity needs to have a store from hash -> node which encompasses 128 tries worth of nodes. It needs to be able to easily evict tries from this store and add tries to this store.

A silly answer is to build a store which doesn't support trie eviction. Every so often the store is wiped away and the most recent trie regenerated. leveldb loves bulk writes so this might be a quick operation.

A simple answer is to store a single copy of the trie in disk and read it into memory upon startup and then do approximately what Geth does. All the intermediate nodes constitute 2.5GB of memory which might balloon into 5GB with python overhead which is probably disqualifying.

I think something similar will work though! If a single copy of the trie is stored on disk the remaining 127 can be kept in memory and the on-disk trie periodically updated. Updating the trie requires some kind of reference counting: the store is node_hash -> (parent_count, node). This is not cheap, every node has up to 16 children so I expect that when a new trie is added even more nodes will need to be updated than added. I think this is okay though. I'm hand waving but there's no need to update the trie every time a block comes in, we could update the trie every 20 blocks, for instance.

This way we can:

lithp commented 4 years ago

The current account database is nearing completion. Some things which now work:

Some next steps:

A todo list with smaller items:

cburgdorf commented 4 years ago

Exciting! Do you have a wip branch online (the linked one doesn't seem to be up to date)?

pipermerriam commented 4 years ago

Seconding what @cburgdorf said. Can you make sure your work is available for us to look at. Visibility into in-progress work is important!

lithp commented 4 years ago

Yeah, sorry, I thought I had already posted it @cburgdorf ! Here's a compare view for the branch I've been using: https://github.com/ethereum/py-evm/compare/master...lithp:lithp/turbo-account-state It's still very much so in "make it work" mode, this will have to be cleaned up a lot before it can be merged in, but here it is!

pipermerriam commented 4 years ago

@lithp you should open a "Draft" PR for that so that it is more easily findable.

lithp commented 4 years ago

Here's the draft PR: https://github.com/ethereum/py-evm/pull/1879