Execution can't recover after crash

ethereum / trin

An Ethereum portal client: a json-rpc server with nearly instant sync, and low CPU & storage usage

379 stars 112 forks source link

Execution can't recover after crash #1440

Open morph-dev opened 2 months ago

morph-dev commented 2 months ago

While running trin execution, it happened that era1 deserialization failed (irrelevant to this issue).

When I tried to resume running it, it would fail very soon afterwards with error: Error: database error: not found database error block_hash

After looking a bit more into it, I found the problem.

The BlockExecutor::manage_block_hash_serve_window modifies the db directly after every processed block. If the execution crashes (like it happened to me) and we try to resume it, the stored block hashes will not be the correct ones (we will have 256 blocks from the moment of crash, not the saved checkpoint).

Possible solutions:

(preferred) Keep track of block hashes in memory and flush them to this when the rest of state is flushed.
Before execution starts, make sure we have all required block hashes in db (and seed them if that's not the case)

morph-dev commented 2 months ago

Alternatively, we can just never delete block_number->block_hash from the db. Clearly, not most optimized solution, but definitely the easiest one.

It's only ~64 bytes per block, so it's not the end of the world (total of ~1.2 GB for entire chain at the moment).

KolbyML commented 2 months ago

I think the right solution is to change from RocksDB to LMDB or MXDB they are both ACID compliment, so if a crash happens we wouldn't have a problem, we could set it to finalize everything once we are done doing the full block execution cycle.

Instead of doing 1 off solutions like are listed above, which won't solve the root problem

KolbyML commented 2 months ago

https://github.com/ethereum/trin/pull/1451#issuecomment-2351083111 https://github.com/ethereum/trin/pull/1451#issuecomment-2351083985

Additional comments I made on this problem, and why switching to an ACID database solves them

morph-dev commented 2 months ago

Why can't we use RocksDB? Instead of using rocksdb::DB, we can use rocksdb::TransactionDB or rocksdb::OptimisticTransactionDB. Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .

I think in our case, we can even use rocksdb::DB::write. Might be the simplest solution.

KolbyML commented 2 months ago

Erigon has a write up here

https://github.com/erigontech/erigon/wiki/Choice-of-storage-engine

They tried like 5 different database solutions then ended up with MDBX.

They say it isn't ACID,

Why can't we use RocksDB? Instead of using rocksdb::DB, we can use rocksdb::TransactionDB or rocksdb::OptimisticTransactionDB. Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .

I think in our case, we can even use rocksdb::DB::write. Might be the simplest solution.

This looks like a good initial start, as it seems to have higher reliability than our current solution, but because various projects have pointed out issues, I am inclined to think it is a bad choice long term.