IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 721 forks source link

[potential BUG] - TraceDidntAdoptBlock for 5 minutes #4720

Closed gufmar closed 1 year ago

gufmar commented 1 year ago

External

Area Other Any other topic (Delegation, Ranking, ...).

Summary At 2022-12-07 07:17:15am GMT from slot 78831144 and block 8107959 on, there was an extraordinary break in block production.
The next block was adopted at 07:21:59am GMT for slot 78831428 and block https://cardanoscan.io/block/8107960 So 284 Seconds or almost 5 minutes without a new block adopted on chain. Multiple SPOs reported logs from the producing node, having a successfull TraceNodeIsLeader > TraceForgedBlock > AddedToCurrentChain but then followed by an unexpected TraceDidntAdoptBlock

System info (please complete the following information): Different blockproducing nodes running on v1.35.3 or .4 on mainnet

Screenshots and attachments Reports from at least 3 Pool operators with forged but not adopted blocks can be found at https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/32

Additional context https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-NodeKernel.html > TraceForgedBlock > TraceDidntAdoptBlock

So this is not an rarely expectable long slot gap, caused by VRF dice probability. The pools had scheduled slots. they were up, running and actually forged the block. but whatever unknown reason caused them to finally not adopt it even localy. Hence also not anounced to the network. All this happened in a row of 5 minutes, or roughly ~15 blocks. It is unclear what caused it, and what happened then to get out of this situation.

abailly-iohk commented 1 year ago

Looking at the code for my own education I could trace the possible origin of this message to be in the GC process. The trace is emitted when getIsInvalidBlock returns Nothing, and this function is defined as:

 fmap (fmap (fmap invalidBlockReason) . flip Map.lookup) <$> readTVar cdbInvalid

and cdbInvalid is only modified in the garbageCollect function. So it's as if the newly created block had been garbage collected before it could be propagated?

abailly-iohk commented 1 year ago

@gufmar Do you have a PerformedGC trace somewhere around this TraceDidntAdoptBlock?

nfrisby commented 1 year ago

As far as I can tell, that forum thread has five blocks of interest.

(by ascending slot)

hash prefix slot blockno predecessor slot as UTC time full hash
9c1 78831128 8107958 don't care don't care 9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac
14a 78831144 8107959 9c1 2022-12-07T-07:17:15.00Z 14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c
15c 78831181 8107959 9c1 2022-12-07T-07:17:52.00Z 15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803
0a2 78831351 8107959 9c1 2022-12-07T-07:20:42.00Z 0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139
bbe 78831428 8107960 14a 2022-12-07T-07:21:59.00Z bbe6c6301d467bb8277e7c461782c8bc34ea62eefd893ecd1b747a0522ac58af

9c1, 14a, and bbe are on the mainnet chain. The alarm begins because 14a and bbe are 4m44s apart, which is "big" for mainnet Cardano. We have two typical hypotheses for some gap might be that big.

A third (bonus 😄) factor is the timeouts.

    mustReplyTimeout <- Just <$> randomElem [90, 135, 180, 224, 269]

If no new block arises before that timeout (randomized per upstream connection), then we disconnect from that peer. Depending on re-connection delays etc, that might amplify the duration of some stall in the chain growth.


Thanks to those sharing logs, we do have some more specific information in this case 👍

In particular, we have two reports of TraceDidntAdoptBlock.

Relevant facts:

So, it could be the case that only 15c and 0a2 were minted during the gap (ie no one else was elected). If they had extended 14a instead of 9c1, then the gap would be smaller.

But! Why didn't 15c and 0a2 extend 14a instead of its predecessor 9c1?

I am puzzled. Seems unlikely 14a would happen to arrive at just the wrong time at two nodes, 170 seconds apart for this to happen.

Other possibilities:


I noticed this while writing the above.

Looking more closely at Ada4Good's logs now, I actually see it select 14a between TraceStartLeadershipCheck and TraceDidntAdoptBlock!

(I remain as mystified as ever about the duplication of AddedToCurrentChain and the absurd elision count 😬.)

TraceStartLeadershipCheck   {"app":[],"at":"2022-12-07T07:20:42.00Z","data":{"kind":"TraceStartLeadershipCheck","slot":78831351},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"82143","sev":"Info","thread":"375"}
TraceNodeIsLeader       {"app":[],"at":"2022-12-07T07:20:42.03Z","data":{"val":{"kind":"TraceNodeIsLeader","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
TraceForgedBlock        {"app":[],"at":"2022-12-07T07:20:42.05Z","data":{"val":{"block":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
AddedToCurrentChain [14a]   {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
<elisions>          {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"kind":"LogValue","name":"before next, messages elided","value":{"contents":427653286419604,"tag":"PureI"}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
AddedToCurrentChain [14a]   {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
TrySwitchToAFork [0a2]      {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"block":{"hash":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","kind":"Point","slot":78831351},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Info","thread":"365"}
TraceDidntAdoptBlock        {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"val":{"kind":"TraceDidntAdoptBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Error","thread":"375"}

So: this is exactly the anticipated reason for ever seeing TraceDidntAdoptBlock. The node started its leadership check, chose to extend the tip of the chain (ie 9c1), minted that block (ie 0a2), but finished validating and selected a different better block (ie 14a) before its newly minted block had a chance to be considered.

What remains odd is that it finished validating 14a 207 seconds after 14a was minted and happened to do so exactly during its leadership attempt, which lasted between 60 and 70 milliseconds? Seems like real bad luck if there turns out to be no systematic correlation here after all...

I don't see this in the logs that BBHMM submitted. But I also don't see TrySwitchToAFork. So either they grep-ed out intervening messages, or their node took the alternate TryAddToCurrentChain path (which has Debug severity as opposed to TrySwitchToAFork's Info severity, since it's the typical path. Although leading is so rare, perhaps we should make it Info too, just for this kind of examination) and then they wouldn't fit this hypothesis.

But the two leadership tests were 170 seconds apart; seems very unlikely that two nodes would be validating 14a at such different times and also exactly during their leadership checks.

chadle-git commented 1 year ago

Hi @nfrisby, Thank you for analyzing. I found the same type scenario in my logs. Let me know if you need any more data that can help with this.

TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:17:52.00Z","data":{"chainDensity":4.78205e-2,"credentials":"Cardano","delegMapSize":1237930,"kind":"TraceStartLeadershipCheck","slot":78831181,"utxoSize":9404848},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"1429","sev":"Info","thread":"214"}

TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:17:52.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

TraceForgedBlock {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"credentials":"Cardano","val":{"block":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

AddedToCurrentChain {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Notice","thread":"204"}

TrySwitchToAFork {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"block":{"hash":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","kind":"Point","slot":78831181},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Info","thread":"204"}

TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Error","thread":"214"}

Straightpool commented 1 year ago

I have observed the same for the first time this epoch. Will update when I learn more from the logs (BP and both relays).

First snipped from the BP, anyone saw the same error at the same time frame?

{"app":[],"at":"2023-01-02T05:13:57.00Z","data":{"chainDensity":4.7301885e-2,"credentials":"Cardano","delegMapSize":1251425,"kind":"TraceStartLeadershipCheck","slot":81070146,"utxoSize":9553777},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"block":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","blockNo":8217108,"blockPrev":"9306145db376fd2aa74ea1bb24ed915f2f69a66d40dfdef2f68e82b53141a337","kind":"TraceForgedBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"block":{"hash":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","kind":"Point","slot":81070146},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"833","sev":"Info","thread":"343"} {"app":[],"at":"2023-01-02T05:13:57.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Error","thread":"353"}

Jimbo4350 commented 1 year ago

Please re-open this issue in https://github.com/input-output-hk/ouroboros-network

chadle-git commented 1 year ago

Opened Issue https://github.com/input-output-hk/ouroboros-network/issues/4252