Atomicity of database updates across block updates

Sword-Smith commented 9 months ago

This logic in main_loop.rs refers to persisting the state updates after receiving a new block. To achieve atomicity, there should probably only be one call to persist which would imply that only one leveldb database was used under the hood. For that, we need to expand the storage crate to handle key/value pairs instead of just lists and singletons, as it currently supports.

// flush wallet databases
        self.global_state
            .wallet_state
            .wallet_db
            .as_ref()
            .lock()
            .await
            .persist();

        // flush block_index database
        self.global_state
            .chain
            .archival_state
            .as_ref()
            .unwrap()
            .block_index_db
            .lock()
            .await
            .flush();

        // persist archival_mutator_set, with sync label
        self.global_state
            .chain
            .archival_state
            .as_ref()
            .unwrap()
            .archival_mutator_set
            .lock()
            .await
            .set_sync_label(
                self.global_state
                    .chain
                    .archival_state
                    .as_ref()
                    .unwrap()
                    .get_latest_block()
                    .await
                    .hash,
            );
        self.global_state
            .chain
            .archival_state
            .as_ref()
            .unwrap()
            .archival_mutator_set
            .lock()
            .await
            .persist();

dan-da commented 9 months ago

I have an incoming PR that addresses this and many other things by locking around GlobalData. big refactor, I think too big to try and test + discuss/review + merge before alphanet-v5, but probably immediately after.

Also, it is on my mental todo to create a DbtHashTable type in twenty_first::storage, just for completeness. I wonder if you have other uses in mind for it?

dan-da commented 8 months ago

I've been thinking about this a lot, and doing a bit of research and experimentation the past few days.

As per my comment in #91, we are transactional/serializable in the code, but we don't have true atomicity for block updates because our data is divided into 5 separate stores:

blocks. data, on disk. (is this internally atomic? unknown to me)
block_index, levelDB. (atomic)
mutator_set, levelDB (atomic)
wallet, levelDB. (atomic)
banned_ips, levelDB. (atomic)

It seems fine for (5) to be separate. Banned IPs are unrelated to the blockchain or wallet state.

Even if we somehow put (2,3,4) into a single levelDB, (1) is still separate. It would be an improvement though.

There is however, a strong argument to be made that wallet state is logically separate from blockchain state. Indeed, some blockchain(s), eg monero implement only blockchain state in the core/node server. Others implement multiple wallets. Further, user's may have need to move/backup their wallet data without everything else. So it would likely be a mistake to tie them into a single DB.

I think then, this means our ideal situation would be that all blockchain state is atomic (1,2,3), and all wallet state (4) is (separately) atomic, as is any peer state, eg banned_ips (5).

Anyway, I think these areas are worth further discussion/thought:

a) is it any problem for wallet state to be temporarily out of sync with blockchain state?

b) are blocks (1) stored on disk in an atomic fashion -- all-or-nothing write for each block update?

c) how might we make (1,2,3) an atomic operation? Would this require storing blockchain data inside our DB? or using some kind of write-ahead log scheme perhaps?

For that, we need to expand the storage crate to handle key/value pairs instead of just lists and singletons

I made an experimental storage::DbtMap. But it immediately became clear that levelDB is not well suited for this. DbtMap implies a logical sub-set of key/val pairs in a LevelDB database. But somehow we have to keep track of which keys are ours, and be able to iterate them. I found 3 possible solutions, but none are very good. See writeup in https://github.com/Neptune-Crypto/twenty-first/pull/181.

note: DbtVec does not have this problem because indices are incrementing by 1. So one can easily find all the keys just by adding 1 each time.

To achieve atomicity, there should probably only be one call to persist which would imply that only one leveldb database was used under the hood.

yes. So how would we achieve this? It seems we at least would want to combine the block-index and mutator-set.
block-index could be converted to a DbtMap I suppose, but again DbtMap is not good for large sets, and block-index is a potentially huge set, over time.

Or maybe the mutator-set Schema could be stored inside the block-index DB, and block-index code remains unchanged. I'm not certain if there are any problems with that or not. It's a possibility.

I also started thinking (daydreaming?) about: what would an ideal replacement for levelDB look like. I came up with this list:

written in rust. (or at least good rust bindings)
production-ready. (solid, well-tested, 1.0+)
in-process. (not a server DB)
key/val store. (at least)
namespaces / sub-tables.
ability to store native rust types. (structs, enums, etc). (nice to have, not critical)
transactional. begin/rollback/commit.
fast

I spent a few hours looking around at all the rust in-process DBs. The best match seems to be redb. Well, it doesn't provide async APIs, but there is a simple workaround to make the api async-friendly. But it does support most everything levelDB does, plus sub-tables, transaction api, and more.

Also, there is native db (see: reddit discussion) that sits on top of redb and offers derive macros that make it very simple to persist rust structs and other types into the DB. This looks super easy to use and could potentially replace twenty-first::storage entirely.

Final thoughts:

Switching out the storage layer is a big task and a decision not to be taken lightly. So we should carefully weigh pros/cons.

I think we need to consider if levelDB will serve our long-term needs or not. And if not, what is the cost of switching now, vs switching after mainnet, when we already have (hopefully) lots of users?

And even if we switched it out, there remains the issue of the blocks file-store, and how to make it atomic with the other 2 blockchain stores -- unless we could actually store the blocks data in a combined blockchain DB that holds all 3.

aszepieniec commented 8 months ago

a) is it any problem for wallet state to be temporarily out of sync with blockchain state?

This is potentially hazardous and different from Bitcoin/Monero because in Neptune the wallet state contains membership proofs which are specific to the blockchain state. If there is a large rift between membership proofs and blockchain state, then UTXOs can become effectively unspendable. Whenever the client stores all blocks on the path from what the current membership proofs are synced to, to the current canonical chain tip, it should be able to recover membership proofs no problem. When this path is not stored, the client relies on the goodwill of peers (and we generally don't want to do that).

dan-da commented 8 months ago

This is potentially hazardous and different from Bitcoin/Monero because in Neptune the wallet state contains membership proofs which are specific to the blockchain state

thx for clarifying. I had a doubt about that. I'm also still trying to understand what the ecosystem can/will look like with regards to 3rd party wallet software, light wallets, etc.

As in, what does the above mean for eg a 3rd party multi-coin wallet that is trying to implement neptune support? Maybe it means they need to at least impl light-mode, and then request blocks from peers as needed?

dan-da commented 8 months ago

There's been some discussion about bitcoin's behavior, so I looked into it a little, and this is the best description I've found:

https://bitcoin.stackexchange.com/a/69762/49396

Reading between the lines, and just from the fact that multiple levelDB are involved, I think they don't actually attempt true ACID type all-or-nothing behavior for blockchain writes. Instead they figure that the blockchain files are the master copy, they are written out lazily from mem on a slow timer. On a restart, the node can detect if a blockchain update didn't finish and restart it. Further, levelDB data is just summary views, and can be rebuilt from the master blockchain if necessary (corrupted or out-of-sync). Presumably there are a bunch of checks during startup to detect problems.

Indeed i've seen some of these behaviors myself, have needed to rebuild block index db, etc.

Having used ACID databases for many years, I much prefer those characteristics and guarantees. But then do we want to put hundreds of gigs of block data into a DB container? vs a "simple" file format of our own devise?

For atomicity of file system writes, a technique I've used often is to write new data in a tmp file, then mv into place once it is fully written/synced. That's because on most modern filesystems, a file rename is an atomic operation. This can also be done with a directory, so multiple files can be written into the tmp dir, then mv dir into place.

So I'm honestly a little baffled as to why bitcoin does it the way described in the link above. I imagine/hope there's a good reason, but I'm not seeing the full picture yet.

Still, a good argument can be made that if its good enough for bitcoin, it's good enough for us. A counter-argument would be that we are trying to build something better, and if we see opportunity for a better solution, should go for it.

So I'm not advocating any approach here, just in analysis mode.

almost forgot: I also found this article by Sergio Demian Lerner about relaxing ACID properties in EVM blockchains, which is an interesting read. https://medium.com/iovlabs-innovation-stories/evm-blockchain-scalability-acid-database-properties-a196b2200

edit: this bitcointalk thread about levelDB reliability is also relevant/interesting. https://bitcointalk.org/index.php?topic=1394020.0

dan-da commented 2 months ago

a while back I asked about bitcoin's db atomicity on stack exchange and Pieter Wuille answered here:

https://bitcoin.stackexchange.com/a/121561/49396

dan-da commented 3 days ago

flawless - a durable execution engine for rust. https://flawless.dev/docs/

enables recovery from external errors like kill -9, power cut, etc. also enables atomicity across databases, eg a transaction that spans pgsql and mysql. Or in our case, it could be across wallet-db, block-db, block-filestore, etc.

it's an interesting approach. probably too late in our dev cycle to adopt it, but good to know of such tools for the toolbox. maybe down the road...

Neptune-Crypto / neptune-core

Atomicity of database updates across block updates #79