Neptune-Crypto / neptune-core

anonymous peer-to-peer cash
Apache License 2.0
23 stars 7 forks source link

feat: global state lock, async leveldb, etc. #89

Closed dan-da closed 6 months ago

dan-da commented 6 months ago

First off, I apologize for the size of this PR. I normally keep PRs as small/focused as I can.

I started a branch while working on twenty_first storage in order to test those changes in neptune. As I iterated on storage that led to the AtomicRw lock in twenty-first, which led to tokio AtomicRw in neptune. At first I was replacing and refactoring (encapsulating) the existing locks and I wanted to be able to log/trace their activity, so I added a facility for that. As I dug deeper I found that we still couldn't really guarantee concurrent write serialization at the highest level ie across Blockchain, Mempool, Wallet state. So I experimented with a lock around GlobalState, and it appears to work well and also simplify code in the other State types as well. Most other changes were sort of incidental along the way, as I was deep into refactoring.

Here are the high level changes:

Many of these changes are inter-related. However, if requested I could pull out some things into smaller, easier to review PRs.

The GlobalState lock is the most significant architectural change and so may need the most justification. Here's a comment from the code that describes and makes a case for it:

`GlobalStateLock` holds a [`tokio::AtomicRw`] ([`RwLock`]) over [`GlobalState`].

Conceptually** all reads and writes of application state
require acuiring this lock.

Having a single lock is useful for a few reasons:
 1. Enables write serialization over all application state.
    (blockchain, mempool, wallet, global flags)
 2. Readers see a consistent view of data.
 3. makes it easy to reason about locking.
 4. simplifies the codebase.

The primary drawback is that long write operations can
block readers.  As such, every effort should be made to keep
write operations as short as possible, though
correctness/atomicity have first priority.

Using an `RwLock` is beneficial for concurrency vs using a `Mutex`.
Readers do not block eachother.  Only a writer blocks readers.
See [`RwLock`](std::sync::RwLock) docs for details.

** At the present time, storage types in twenty_first::storage
implement their own locking, which means they can be mutated
without acquiring the `GlobalStateLock`.  This may change in
the future.

Usage conventions:

// property naming:
struct Foo {
    global_state_lock: GlobalStateLock
}

// read guard naming:
let global_state = foo.global_state_lock.lock_guard().await;

// write guard naming:
let global_state_mut = foo.global_state_lock.lock_guard_mut().await;

These conventions make it easy to distinguish read access from write
access when reading and reviewing code.

When using a read-guard or write-guard, always drop it as soon as possible.
Failure to do so can result in poor concurrency or deadlock.

Deadlocks are generally not hard to track down.  Lock events are traced.
The app log records each `TryAcquire`, `Acquire` and `Release` event
when run with `RUST_LOG='info,neptune_core=trace'`.

If a deadlock has occurred, the log will end with a `TryAcquire` event
(read or write) and just scroll up to find the previous `Acquire` for
write event to see which thread is holding the lock.

So yeah I think the GlobalStateLock is a good change. If nothing else, it makes a lot of code easier to read/maintain without all the interior locks. But if others don't agree, I am willing to do some more work to inform and hopefully justify it, such as creating/running some perf benchmarks... (though that would take time that could be spent on other things). Alternatively, some of the other changes could be split out into separate PRs. I also have some older code from the branch that encapsulates existing locks rather than using GlobalStateLock.

Notes:

Testing:

dan-da commented 6 months ago

I just pushed a fix for the prune_abandoned_monitored_utxos() issue discovered in testing.

Here are my notes about the investigation.

debug_prune_abandoned.txt

Sword-Smith commented 6 months ago

Regarding that commit 9b9ff2c that solves the deadlock issue in prune_abandoned_monitored_utxos, I found this comment to be helpful:

I see the issue. You shouldn’t be using futures::executor alongside with Tokio as neither executor (Tokio or futures) are aware of one another, which explains the blocking behavior you’re seeing.

I’ll also note that #[tokio::main] expands into tokio::Runtime::block_on, which means this code has two nested block_ons.

Good job solving this problem! It's been a thorn side for quite a while.

Sword-Smith commented 6 months ago

I can get my node running this version to crash like so:

$  neptune-cli prune-abandoned-monitored-utxos 
5 monitored UTXOs marked as abandoned

Then wait for the resync_membership_proofs task to be run by main_loop. This calls resync_membership_proofs_from_stored_blocks which panics on this check:

assert_eq!(block_msa, apply_block.body.next_mutator_set_accumulator);

Error message:

thread 'main' panicked at /home/thv/repos/neptune-core/src/models/state/mod.rs:809:17:                                                                                                                             
assertion `left == right` failed 

where the swbf_inactive field disagree in the two MS accumulators being compared.

I believe this problem is a regression, as this is a crash I haven't seen before.

Edit: Most likely due to corrupted database. I managed to recover my balances by resetting the node. Let's not pursue this problem any further unless it arises again.

Sword-Smith commented 6 months ago

I can get my node running this version to crash like so: ...

I reset the node, with prunable UTXOs, and could not reproduce the problem. This is likely due to a corrupted database, and I'm not sure we should pursue this further for now.

Sword-Smith commented 6 months ago

LGTM