feat: global state lock, async leveldb, etc.

dan-da commented 6 months ago

First off, I apologize for the size of this PR. I normally keep PRs as small/focused as I can.

I started a branch while working on twenty_first storage in order to test those changes in neptune. As I iterated on storage that led to the AtomicRw lock in twenty-first, which led to tokio AtomicRw in neptune. At first I was replacing and refactoring (encapsulating) the existing locks and I wanted to be able to log/trace their activity, so I added a facility for that. As I dug deeper I found that we still couldn't really guarantee concurrent write serialization at the highest level ie across Blockchain, Mempool, Wallet state. So I experimented with a lock around GlobalState, and it appears to work well and also simplify code in the other State types as well. Most other changes were sort of incidental along the way, as I was deep into refactoring.

Here are the high level changes:

create GlobalStateLock and use it for all state locking.
remove other locks (except those in storage layer, still a todo)
add tokio AtomicRw and AtomicMutex async lock types.
trace (log) all lock events (TryAcquire, Acquire, Release)
use tokio Builder when spawning tasks to provide a task name for tracing.
add RustyLevelDbAsync async-friendly DB wrapper
simplify LightState, now just a type alias for Block
add enum BlockChainState to distinguish Archival vs Light mode
use twenty_first storage iterators where appropriate
improve some doc-comments, cargo doc runs without warnings
general cleanups and refactors
log timestamp of each mined block, converted to localtime
display date fields using user's localtime in neptune-dashboard

Many of these changes are inter-related. However, if requested I could pull out some things into smaller, easier to review PRs.

The GlobalState lock is the most significant architectural change and so may need the most justification. Here's a comment from the code that describes and makes a case for it:

`GlobalStateLock` holds a [`tokio::AtomicRw`] ([`RwLock`]) over [`GlobalState`].

Conceptually** all reads and writes of application state
require acuiring this lock.

Having a single lock is useful for a few reasons:
 1. Enables write serialization over all application state.
    (blockchain, mempool, wallet, global flags)
 2. Readers see a consistent view of data.
 3. makes it easy to reason about locking.
 4. simplifies the codebase.

The primary drawback is that long write operations can
block readers.  As such, every effort should be made to keep
write operations as short as possible, though
correctness/atomicity have first priority.

Using an `RwLock` is beneficial for concurrency vs using a `Mutex`.
Readers do not block eachother.  Only a writer blocks readers.
See [`RwLock`](std::sync::RwLock) docs for details.

** At the present time, storage types in twenty_first::storage
implement their own locking, which means they can be mutated
without acquiring the `GlobalStateLock`.  This may change in
the future.

Usage conventions:

// property naming:
struct Foo {
    global_state_lock: GlobalStateLock
}

// read guard naming:
let global_state = foo.global_state_lock.lock_guard().await;

// write guard naming:
let global_state_mut = foo.global_state_lock.lock_guard_mut().await;

These conventions make it easy to distinguish read access from write
access when reading and reviewing code.

When using a read-guard or write-guard, always drop it as soon as possible.
Failure to do so can result in poor concurrency or deadlock.

Deadlocks are generally not hard to track down.  Lock events are traced.
The app log records each `TryAcquire`, `Acquire` and `Release` event
when run with `RUST_LOG='info,neptune_core=trace'`.

If a deadlock has occurred, the log will end with a `TryAcquire` event
(read or write) and just scroll up to find the previous `Acquire` for
write event to see which thread is holding the lock.

So yeah I think the GlobalStateLock is a good change. If nothing else, it makes a lot of code easier to read/maintain without all the interior locks. But if others don't agree, I am willing to do some more work to inform and hopefully justify it, such as creating/running some perf benchmarks... (though that would take time that could be spent on other things). Alternatively, some of the other changes could be split out into separate PRs. I also have some older code from the branch that encapsulates existing locks rather than using GlobalStateLock.

Notes:

I have squashed the 70+ original commits since early December into a single commit because rebasing all of those onto present tip would have taken ages.
I had some twenty-first dep difficulties:
- This relies on some twenty-first changes made after 0.36.0. So there is a Cargo.toml patch section. The patches are in branch 0_36_backports in twenty-first repo.
- I couldn't just use twenty-first master because it appears to contain some changes that are incompatible with triton-vm and/or tasm.
- triton-vm now exports twenty_first (yay!) but tasm does not yet export triton-vm. If/when that happens, we could/should change all use statements in neptune-core, and that would enable mixed versions of twenty-first for times like this. But for now the 0_36_backports branch is a workaround.
- If twenty-first::storage and twenty-first::sync were in their own crates, that would have solved this issue also. Perhaps eventually.

Testing:

ran two nodes on local subnet overnight, each starting with no wallet or blockchain, connected only to eachother, with dashboard running. Both found blocks and reported no significant errors.
synced with alphanet-v5 peers.
passes all unit tests and doc tests
cargo doc generates docs without warnings.
fmt and cargo are clean, including tests

dan-da commented 6 months ago

I just pushed a fix for the prune_abandoned_monitored_utxos() issue discovered in testing.

Here are my notes about the investigation.

debug_prune_abandoned.txt

Sword-Smith commented 6 months ago

Regarding that commit 9b9ff2c that solves the deadlock issue in prune_abandoned_monitored_utxos, I found this comment to be helpful:

I see the issue. You shouldn’t be using futures::executor alongside with Tokio as neither executor (Tokio or futures) are aware of one another, which explains the blocking behavior you’re seeing.

I’ll also note that #[tokio::main] expands into tokio::Runtime::block_on, which means this code has two nested block_ons.

Good job solving this problem! It's been a thorn side for quite a while.

Sword-Smith commented 6 months ago

I can get my node running this version to crash like so:

$  neptune-cli prune-abandoned-monitored-utxos 
5 monitored UTXOs marked as abandoned

Then wait for the resync_membership_proofs task to be run by main_loop. This calls resync_membership_proofs_from_stored_blocks which panics on this check:

assert_eq!(block_msa, apply_block.body.next_mutator_set_accumulator);

Error message:

thread 'main' panicked at /home/thv/repos/neptune-core/src/models/state/mod.rs:809:17:                                                                                                                             
assertion `left == right` failed

where the swbf_inactive field disagree in the two MS accumulators being compared.

I believe this problem is a regression, as this is a crash I haven't seen before.

Edit: Most likely due to corrupted database. I managed to recover my balances by resetting the node. Let's not pursue this problem any further unless it arises again.

Sword-Smith commented 6 months ago

I can get my node running this version to crash like so: ...

I reset the node, with prunable UTXOs, and could not reproduce the problem. This is likely due to a corrupted database, and I'm not sure we should pursue this further for now.

Sword-Smith commented 6 months ago

LGTM

Neptune-Crypto / neptune-core

feat: global state lock, async leveldb, etc. #89