Deadlock caused by CLI RPC call `prune-abandoned-monitored-utxos`?

Sword-Smith commented 9 months ago

A call to this CLI command seems to have caused a deadlock in my node and left it hanging.

2023-10-24T09:17:21.051751975Z  INFO ThreadId(02) RPC{rpc.trace_id=00 rpc.deadline=2023-10-24T09:17:31.051541029Z otel.kind="server"}: tarpc::server: ReceiveRequest
2023-10-24T09:17:21.052049314Z  INFO ThreadId(02) RPC{rpc.trace_id=00 rpc.deadline=2023-10-24T09:17:31.051541029Z otel.kind="server" otel.name="RPC.prune_abandoned_monitored_utxos"}: tarpc::server: BeginRequest

Nothing after the above line was outputted by the client, indicating a deadlock in the software.

dan-da commented 8 months ago

To avoid deadlock problems, perhaps we might consider using a crate like locktree or ordered_locks. There might be others. I found these with a quick search.

locktree (ab)uses Rust's type system to guarantee that locks are always taken in the same order. Locks under locktree's management are organized into a linear sequence. Locks can only be acquired by moving forward into this sequence, and locks must be released when moving back. Thus it is statically impossible for two threads to acquire the same set of locks in different orders, and so deadlocks are impossible as long as all locks are managed by locktree.

Here are the tests for locktree, that give an idea of usage. https://github.com/Ereski/locktree/tree/master/locktree/tests

edit: there is also the tracing_mutex crate. This would probably be simpler/faster to impl as a first step and might be all we need to resolve issues.

Mutexes can deadlock each other, but you can avoid this by always acquiring your locks in a consistent order. This crate provides tracing to ensure that you do.

This crate tracks a virtual “stack” of locks that the current thread holds, and whenever a new lock is acquired, a dependency is created from the last lock to the new one. These dependencies together form a graph. As long as that graph does not contain any cycles, your program is guaranteed to never deadlock.

edit: there is also the lock_ordering crate

edit: lockbud looks like a promising tool to statically detect out-of-order locking and more. The author has found problems in openethereum, grin, and other projects. I plan to give it a try.

dan-da commented 8 months ago

I installed and ran lockbud, which performs static (compile time) analysis. Here is the full results log.

it did not detect any problems in neptune-core. When analyzing triton-vm, it detected two possible atomicity violations in tokio fn that are called by triton-vm. Not a smoking gun, but worth further investigation.

Examples of atomicity violations can be found at: https://github.com/BurtonQin/lockbud/blob/all/Code/toys/atomic-violation/src/main.rs

These are all the dep crates that had possible issues: once_cell, getrandom, parking_lot, opentelemetry_sdk, tracing_subscriber, tokio, tokio_util.

If anyone ever needs to reproduce, the process is pretty easy. The steps were:

$ git clone https://github.com/BurtonQin/lockbud.git
$ cd lockbud
$ rustup component add rust-src
$ rustup component add rustc-dev
$ rustup component add llvm-tools-preview
$ cd Code
$ cargo install --path .

# neptune specific:

$ ln -s rust-toolchain.toml /path/to/neptune-core
$ ./detect.sh /path/to/neptune-core/ 2>&1 | tee neptune.log

dan-da commented 8 months ago

I am able to reproduce this reliably, though behavior varies a bit from that described @Sword-Smith.

Steps:

Start neptune-core running with command: RUST_BACKTRACE=1 RUST_LOG='info,neptune-core=warn,tarpc=warn' cargo run --bin neptune-core -- --mine --throttled-mining --peers 51.15.139.238:9798 --peers 139.162.193.206:9798
Wait until log indicates mining has begun.
Observe in top that CPU usage is well below 50%.
Execute neptune-cli prune-abandoned-monitored-utxos

Observations:

tarpc logs DeadLineExceeded.
neptune-cli prints Error: the request exceeded its deadline
top reports neptune-cli using 100% cpu.
Some neptune-cli rpc calls still succeed, eg: block-height, get-peer-info
Other rpc calls hang, eg: synced-balance, wallet-status
shutdown rpc call appears to succeed in neptune-cli output, however the server does not actually shutdown.
Server logs SIGINT and SIGTERM, but does not terminate.
Finally can be killed with killall -9 neptune-core.
I did not see the same log-lines as @Sword-Smith. Possibly due to different invocation. @Sword-Smith can you post the command you start neptune-core with?

Here is a sample log from the last of 3 runs:

2023-11-07T00:33:26.991327035Z  INFO ThreadId(01) neptune_core::main_loop: Performing peer discovery
2023-11-07T00:34:24.034572892Z  WARN ThreadId(01) neptune_core::models::state: Could not recover MSMP as transaction appears to be on an abandoned chain
2023-11-07T00:34:41.685048134Z ERROR ThreadId(03) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:36:45.765176191Z ERROR ThreadId(02) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:37:14.501046128Z ERROR ThreadId(03) tarpc::server::in_flight_requests: DeadlineExceeded
2023-11-07T00:37:26.624843189Z  INFO ThreadId(03) neptune_core::peer_loop: Got new block from peer 51.15.139.238:9798, block height 8914
2023-11-07T00:37:26.627328157Z  INFO ThreadId(03) neptune_core::models::blockchain::transaction::validity: validity logic for 'kernel_to_lock_scripts', 'kernel_to_type_scripts', 'type_scripts_halt' not implemented yet.
2023-11-07T00:37:26.652194487Z  INFO ThreadId(03) neptune_core::peer_loop: Block with height 8914 is valid
2023-11-07T00:37:26.652287726Z  INFO ThreadId(03) neptune_core::peer_loop: Updated block info by block from peer. block height 8914
2023-11-07T00:37:27.947861033Z  INFO ThreadId(03) neptune_core::peer_loop: Got new block from peer 139.162.193.206:9798, block height 8914
2023-11-07T00:37:27.950371622Z  INFO ThreadId(03) neptune_core::models::blockchain::transaction::validity: validity logic for 'kernel_to_lock_scripts', 'kernel_to_type_scripts', 'type_scripts_halt' not implemented yet.
2023-11-07T00:37:27.974812671Z  INFO ThreadId(03) neptune_core::peer_loop: Block with height 8914 is valid
2023-11-07T00:37:27.974893116Z  INFO ThreadId(03) neptune_core::peer_loop: Updated block info by block from peer. block height 8914
^C2023-11-07T00:38:43.449094383Z  INFO ThreadId(03) neptune_core::main_loop: Received SIGINT

Analysis:

Given that this is so far 100% reproduceable, it should be feasible to track down.
The 100% CPU usage appears to indicate the code is stuck in a loop somewhere, rather than deadlocked.

Sword-Smith commented 6 months ago

Fixed by @dan-da in 1a97899191215b1af385a2ed047829dfa6fe3f29

Neptune-Crypto / neptune-core

Deadlock caused by CLI RPC call `prune-abandoned-monitored-utxos`? #64