Open morph-dev opened 2 months ago
Alternatively, we can just never delete block_number->block_hash from the db. Clearly, not most optimized solution, but definitely the easiest one.
It's only ~64 bytes per block, so it's not the end of the world (total of ~1.2 GB for entire chain at the moment).
I think the right solution is to change from RocksDB
to LMDB
or MXDB
they are both ACID compliment, so if a crash happens we wouldn't have a problem, we could set it to finalize everything once we are done doing the full block execution cycle.
Instead of doing 1 off solutions like are listed above, which won't solve the root problem
https://github.com/ethereum/trin/pull/1451#issuecomment-2351083111 https://github.com/ethereum/trin/pull/1451#issuecomment-2351083985
Additional comments I made on this problem, and why switching to an ACID database solves them
Why can't we use RocksDB? Instead of using rocksdb::DB
, we can use rocksdb::TransactionDB
or rocksdb::OptimisticTransactionDB
.
Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .
I think in our case, we can even use rocksdb::DB::write
. Might be the simplest solution.
Erigon has a write up here
https://github.com/erigontech/erigon/wiki/Choice-of-storage-engine
They tried like 5 different database solutions then ended up with MDBX
.
They say it isn't ACID,
Why can't we use RocksDB? Instead of using
rocksdb::DB
, we can userocksdb::TransactionDB
orrocksdb::OptimisticTransactionDB
. Difference between transaction and optimistic transaction can be found here: https://github.com/facebook/rocksdb/wiki/Transactions .I think in our case, we can even use
rocksdb::DB::write
. Might be the simplest solution.
This looks like a good initial start, as it seems to have higher reliability than our current solution, but because various projects have pointed out issues, I am inclined to think it is a bad choice long term.
While running trin execution, it happened that era1 deserialization failed (irrelevant to this issue).
When I tried to resume running it, it would fail very soon afterwards with error:
Error: database error: not found database error block_hash
After looking a bit more into it, I found the problem.
The
BlockExecutor::manage_block_hash_serve_window
modifies the db directly after every processed block. If the execution crashes (like it happened to me) and we try to resume it, the stored block hashes will not be the correct ones (we will have 256 blocks from the moment of crash, not the saved checkpoint).Possible solutions: