IntersectMBO / ouroboros-consensus

Implementation of a Consensus Layer for the Ouroboros family of protocols
https://ouroboros-consensus.cardano.intersectmbo.org
Apache License 2.0
31 stars 22 forks source link

Nodes stuck in forging loop #1193

Closed jasagredo closed 4 weeks ago

jasagredo commented 1 month ago

Analyzing the logs from the two nodes that stoped forging, it seems clear that they both stopped in the following chunk of code:

      trace $ TraceForgeTickedLedgerState currentSlot bcPrevPoint

      -- Get a snapshot of the mempool that is consistent with the ledger
      --
      -- NOTE: It is possible that due to adoption of new blocks the
      -- /current/ ledger will have changed. This doesn't matter: we will
      -- produce a block that fits onto the ledger we got above; if the
      -- ledger in the meantime changes, the block we produce here may or
      -- may not be adopted, but it won't be invalid.
      (mempoolHash, mempoolSlotNo) <- lift $ atomically $ do
        snap <- getSnapshot mempool   -- only used for its tip-like information
        let h :: ChainHash blk
            h = castHash $ snapshotTipHash snap
        pure (h, snapshotSlotNo snap)

      let readTables = fmap castLedgerTables . roforkerReadTables forker . castLedgerTables

      mempoolSnapshot <- lift $ getSnapshotFor
                                  mempool
                                  currentSlot
                                  tickedLedgerState
                                  readTables

      lift $ roforkerClose forker

      let txs = map fst $ snapshotTxs mempoolSnapshot

      -- force the mempool's computation before the tracer event
      _ <- evaluate (length txs)
      _ <- evaluate (snapshotTipHash mempoolSnapshot)

      trace $ TraceForgingMempoolSnapshot currentSlot bcPrevPoint mempoolHash mempoolSlotNo

The Mempool uses a TMVar to be able to run actions in IO to access the tables, however if an exception was raised at some point the TMVar might not have been properly put back. It seems unclear what exception could be thrown there, but the result seems pretty clear: the TMVar is not put back and the node cannot continue.

jasagredo commented 1 month ago

Analyzing the given logs we found an error call being thrown:

2407853-{"at":"2024-07-23T04:39:52.549336219Z","ns":"TxSubmission.TxInbound.Collected","data":{"count":1,"kind":"TraceTxSubmissionCollected","peer":{"connectionId":"172.31.82.173:30004 52.59.156.159:59440"}},"sev":"Debug","thread":"15279","host":"client-us-02"}
2407854:{"at":"2024-07-23T04:39:52.549589434Z","ns":"Net.InboundGovernor.Remote.ResponderErrored","data":{"connectionId":{"localAddress":{"address":"172.31.82.173","port":"30004"},"remoteAddress":{"address":"52.202.221.142","port":"30007"}},"kind":"ResponderErrored","miniProtocolNum":{"kind":"MiniProtocolNum","num":4},"reason":"InMemoryClosedExn\nCallStack (from HasCallStack):\n  error, called at src/ouroboros-consensus/Ouroboros/Consensus/Storage/LedgerDB/V2/InMemory.hs:75:45 in ouroboros-consensus-0.20.0.0-1Z2ZJe3BSfK6gl0UefXzYq:Ouroboros.Consensus.Storage.LedgerDB.V2.InMemory"},"sev":"Info","thread":"119","host":"client-us-02"}

The miniprotocol in question is TxSumbission, so it makes sense that if it errored then the TMVar was not put back.

Some consensus code is using a LedgerTablesHandle that has been closed. Question is now, where is this handle, who is using it after close.

jasagredo commented 4 weeks ago

Closed as complete via #1208