Neptune-Crypto / neptune-core

anonymous peer-to-peer cash
Apache License 2.0
28 stars 8 forks source link

Limit Writes to Main Thread #201

Open Sword-Smith opened 1 month ago

Sword-Smith commented 1 month ago

Currently, the RPC task is allowed (by convention) to acquire write locks on the wallet state. This generates the subtle issue: what if writing fails?

Currently:

            // ensure we write new wallet state out to disk.
            gsm.persist_wallet().await.expect("flushed wallet");

it only crashes the RPC task and not the entire program. If writes fail, we should crash the entire program.

Limiting write permission to the main thread would enable the main thread to terminate the program when it observes a write fail.

This change would imply that every time the RPC task acquires a write lock, it must instead send a write message to the main task through its RPCServerToMain channel.

Co-authored-by: @aszepieniec

dan-da commented 1 month ago

Some observations / details:

  1. moving all disk-writes to main task adds complexity. (and cognitive load). How does RPC task ensure each disk write succeeds for example?

  2. RPC-server test-cases as presently written do not execute the main task, only a dummy. So such inter-task dependencies do not get executed by the tests. I ran into this when testing the send() RPC and found that Tx were not being added to the mempool because that was supposed to be done by the main task.

  3. if the goal is to panic/shutdown on disk-write error, that can be achieved without moving writes to main thread. When a tokio task panics this can be detected by the corresponding JoinHandle. So the parent/monitoring task can: (a) ignore the panic (b) handle it somehow and continue, (c) raise the panic and/or shutdown. Presently we do (a) because the select!() in main_loop doesn't monitor the tasks in task_handles array. But we could change that, so that a panic in any sub-task would cause entire program to halt. I would view that approach as preferable. I don't think we should be ignoring panics, for whatever reason.

4) shutting down on disk-write failure with a useful log message is probably a reasonable thing to do for now. However this can still leave DBs in an inconsistent state, if write to file #1 succeeds and then write to file #2 fails. Long term, I'd like to see true atomic operations, ie changes for a given logical operation are either 100% written or not at all.