Closed Sword-Smith closed 6 months ago
To avoid deadlock problems, perhaps we might consider using a crate like locktree or ordered_locks. There might be others. I found these with a quick search.
locktree (ab)uses Rust's type system to guarantee that locks are always taken in the same order. Locks under locktree's management are organized into a linear sequence. Locks can only be acquired by moving forward into this sequence, and locks must be released when moving back. Thus it is statically impossible for two threads to acquire the same set of locks in different orders, and so deadlocks are impossible as long as all locks are managed by locktree.
Here are the tests for locktree, that give an idea of usage. https://github.com/Ereski/locktree/tree/master/locktree/tests
edit: there is also the tracing_mutex crate. This would probably be simpler/faster to impl as a first step and might be all we need to resolve issues.
Mutexes can deadlock each other, but you can avoid this by always acquiring your locks in a consistent order. This crate provides tracing to ensure that you do.
This crate tracks a virtual “stack” of locks that the current thread holds, and whenever a new lock is acquired, a dependency is created from the last lock to the new one. These dependencies together form a graph. As long as that graph does not contain any cycles, your program is guaranteed to never deadlock.
edit: there is also the lock_ordering crate
edit: lockbud looks like a promising tool to statically detect out-of-order locking and more. The author has found problems in openethereum, grin, and other projects. I plan to give it a try.
I installed and ran lockbud, which performs static (compile time) analysis. Here is the full results log.
it did not detect any problems in neptune-core. When analyzing triton-vm, it detected two possible atomicity violations in tokio fn that are called by triton-vm. Not a smoking gun, but worth further investigation.
Examples of atomicity violations can be found at: https://github.com/BurtonQin/lockbud/blob/all/Code/toys/atomic-violation/src/main.rs
These are all the dep crates that had possible issues: once_cell, getrandom, parking_lot, opentelemetry_sdk, tracing_subscriber, tokio, tokio_util.
If anyone ever needs to reproduce, the process is pretty easy. The steps were:
$ git clone https://github.com/BurtonQin/lockbud.git
$ cd lockbud
$ rustup component add rust-src
$ rustup component add rustc-dev
$ rustup component add llvm-tools-preview
$ cd Code
$ cargo install --path .
# neptune specific:
$ ln -s rust-toolchain.toml /path/to/neptune-core
$ ./detect.sh /path/to/neptune-core/ 2>&1 | tee neptune.log
I am able to reproduce this reliably, though behavior varies a bit from that described @Sword-Smith.
Steps:
Start neptune-core running with command:
RUST_BACKTRACE=1 RUST_LOG='info,neptune-core=warn,tarpc=warn' cargo run --bin neptune-core -- --mine --throttled-mining --peers 51.15.139.238:9798 --peers 139.162.193.206:9798
Wait until log indicates mining has begun.
Observe in top
that CPU usage is well below 50%.
Execute neptune-cli prune-abandoned-monitored-utxos
Observations:
Error: the request exceeded its deadline
top
reports neptune-cli using 100% cpu.killall -9 neptune-core
.Here is a sample log from the last of 3 runs:
2023-11-07T00:33:26.991327035Z INFO ThreadId(01) neptune_core::main_loop: Performing peer discovery
2023-11-07T00:34:24.034572892Z WARN ThreadId(01) neptune_core::models::state: Could not recover MSMP as transaction appears to be on an abandoned chain
2023-11-07T00:34:41.685048134Z ERROR ThreadId(03) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:36:45.765176191Z ERROR ThreadId(02) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:37:14.501046128Z ERROR ThreadId(03) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:37:26.624843189Z INFO ThreadId(03) neptune_core::peer_loop: Got new block from peer 51.15.139.238:9798, block height 8914
2023-11-07T00:37:26.627328157Z INFO ThreadId(03) neptune_core::models::blockchain::transaction::validity: validity logic for 'kernel_to_lock_scripts', 'kernel_to_type_scripts', 'type_scripts_halt' not implemented yet.
2023-11-07T00:37:26.652194487Z INFO ThreadId(03) neptune_core::peer_loop: Block with height 8914 is valid
2023-11-07T00:37:26.652287726Z INFO ThreadId(03) neptune_core::peer_loop: Updated block info by block from peer. block height 8914
2023-11-07T00:37:27.947861033Z INFO ThreadId(03) neptune_core::peer_loop: Got new block from peer 139.162.193.206:9798, block height 8914
2023-11-07T00:37:27.950371622Z INFO ThreadId(03) neptune_core::models::blockchain::transaction::validity: validity logic for 'kernel_to_lock_scripts', 'kernel_to_type_scripts', 'type_scripts_halt' not implemented yet.
2023-11-07T00:37:27.974812671Z INFO ThreadId(03) neptune_core::peer_loop: Block with height 8914 is valid
2023-11-07T00:37:27.974893116Z INFO ThreadId(03) neptune_core::peer_loop: Updated block info by block from peer. block height 8914
^C2023-11-07T00:38:43.449094383Z INFO ThreadId(03) neptune_core::main_loop: Received SIGINT
Analysis:
Fixed by @dan-da in 1a97899191215b1af385a2ed047829dfa6fe3f29
A call to this CLI command seems to have caused a deadlock in my node and left it hanging.
Nothing after the above line was outputted by the client, indicating a deadlock in the software.