Closed dan-da closed 6 months ago
I just pushed a fix for the prune_abandoned_monitored_utxos()
issue discovered in testing.
Here are my notes about the investigation.
Regarding that commit 9b9ff2c that solves the deadlock issue in prune_abandoned_monitored_utxos
, I found this comment to be helpful:
I see the issue. You shouldn’t be using futures::executor alongside with Tokio as neither executor (Tokio or futures) are aware of one another, which explains the blocking behavior you’re seeing.
I’ll also note that #[tokio::main] expands into tokio::Runtime::block_on, which means this code has two nested block_ons.
Good job solving this problem! It's been a thorn side for quite a while.
I can get my node running this version to crash like so:
$ neptune-cli prune-abandoned-monitored-utxos
5 monitored UTXOs marked as abandoned
Then wait for the resync_membership_proofs
task to be run by main_loop
. This calls resync_membership_proofs_from_stored_blocks
which panics on this check:
assert_eq!(block_msa, apply_block.body.next_mutator_set_accumulator);
Error message:
thread 'main' panicked at /home/thv/repos/neptune-core/src/models/state/mod.rs:809:17:
assertion `left == right` failed
where the swbf_inactive
field disagree in the two MS accumulators being compared.
I believe this problem is a regression, as this is a crash I haven't seen before.
Edit: Most likely due to corrupted database. I managed to recover my balances by resetting the node. Let's not pursue this problem any further unless it arises again.
I can get my node running this version to crash like so: ...
I reset the node, with prunable UTXOs, and could not reproduce the problem. This is likely due to a corrupted database, and I'm not sure we should pursue this further for now.
LGTM
First off, I apologize for the size of this PR. I normally keep PRs as small/focused as I can.
I started a branch while working on twenty_first storage in order to test those changes in neptune. As I iterated on storage that led to the AtomicRw lock in twenty-first, which led to tokio AtomicRw in neptune. At first I was replacing and refactoring (encapsulating) the existing locks and I wanted to be able to log/trace their activity, so I added a facility for that. As I dug deeper I found that we still couldn't really guarantee concurrent write serialization at the highest level ie across Blockchain, Mempool, Wallet state. So I experimented with a lock around GlobalState, and it appears to work well and also simplify code in the other State types as well. Most other changes were sort of incidental along the way, as I was deep into refactoring.
Here are the high level changes:
GlobalStateLock
and use it for all state locking.AtomicRw
andAtomicMutex
async lock types.RustyLevelDbAsync
async-friendly DB wrapperLightState
, now just a type alias for BlockBlockChainState
to distinguish Archival vs Light modecargo doc
runs without warningsneptune-dashboard
Many of these changes are inter-related. However, if requested I could pull out some things into smaller, easier to review PRs.
The GlobalState lock is the most significant architectural change and so may need the most justification. Here's a comment from the code that describes and makes a case for it:
So yeah I think the GlobalStateLock is a good change. If nothing else, it makes a lot of code easier to read/maintain without all the interior locks. But if others don't agree, I am willing to do some more work to inform and hopefully justify it, such as creating/running some perf benchmarks... (though that would take time that could be spent on other things). Alternatively, some of the other changes could be split out into separate PRs. I also have some older code from the branch that encapsulates existing locks rather than using
GlobalStateLock
.Notes:
0_36_backports
in twenty-first repo.Testing: