can't synchronize 2023-11-01

aszepieniec commented 8 months ago

After restarting the node from the crash, the node crashes again with the log ending in this:

2023-11-01T11:29:07.221667976Z DEBUG ThreadId(01) neptune_core::models::state::wallet::wallet_state: Block has 0 removal records
2023-11-01T11:29:07.221707216Z DEBUG ThreadId(01) neptune_core::models::state::wallet::wallet_state: Transaction has 0 inputs
2023-11-01T11:29:07.221759008Z DEBUG ThreadId(01) neptune_core::models::state::wallet::wallet_state: Number of mutated membership proofs: 0
2023-11-01T11:29:07.429363168Z DEBUG ThreadId(01) neptune_core::models::state::wallet::wallet_state: Number of unspent UTXOs: 2656
2023-11-01T11:29:07.586939683Z DEBUG ThreadId(01) neptune_core::main_loop: Flushed all databases
2023-11-01T11:29:07.587078226Z DEBUG ThreadId(01) neptune_core::main_loop: Timer: block-synchronization job
2023-11-01T11:29:07.587094596Z  INFO ThreadId(01) neptune_core::main_loop: Running sync
2023-11-01T11:29:07.587110247Z  WARN ThreadId(01) neptune_core::main_loop: Could not read current block. Aborting block synchronization

I had to kill the process with kill -9.

aszepieniec commented 8 months ago

After restarting the node again, it synchronizes without any issues.

dan-da commented 8 months ago

hmm, I took a quick look and I don't see anything obviously wrong. This line:

Could not read current block. Aborting block synchronization

is produced by:

        let current_block = match self.global_state.chain.light_state.latest_block.try_lock() {
            Ok(lock) => lock.to_owned(),

            // If we can't acquire lock on latest block header, don't block. Just exit and try again next
            // time.
            Err(_) => {
                warn!("Could not read current block. Aborting block synchronization");
                return Ok(());
            }
        };

latest_block is a tokio::sync::Mutex, and try_lock() only fails (immediately) if another task is holding the lock. In this case, we log a warning and return Ok and the caller just sets a timer and then tries again.

I'm thinking perhaps the other task holding the lock is causing the problem. If/when it happens again, please check if there are any other abnormal log messages prior. Also I will be on the lookout for this in my node.

dan-da commented 8 months ago

if a deadlock issue, this might help.

dan-da commented 5 months ago

this is pretty old, and code has changed a lot. closing.

Neptune-Crypto / neptune-core

can't synchronize 2023-11-01 #68