[potential BUG] - TraceDidntAdoptBlock for 5 minutes

gufmar commented 1 year ago

External

Area Other Any other topic (Delegation, Ranking, ...).

Summary At 2022-12-07 07:17:15am GMT from slot 78831144 and block 8107959 on, there was an extraordinary break in block production.
The next block was adopted at 07:21:59am GMT for slot 78831428 and block https://cardanoscan.io/block/8107960 So 284 Seconds or almost 5 minutes without a new block adopted on chain. Multiple SPOs reported logs from the producing node, having a successfull TraceNodeIsLeader > TraceForgedBlock > AddedToCurrentChain but then followed by an unexpected TraceDidntAdoptBlock

System info (please complete the following information): Different blockproducing nodes running on v1.35.3 or .4 on mainnet

Screenshots and attachments Reports from at least 3 Pool operators with forged but not adopted blocks can be found at https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/32

Additional context https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-NodeKernel.html > TraceForgedBlock > TraceDidntAdoptBlock

So this is not an rarely expectable long slot gap, caused by VRF dice probability. The pools had scheduled slots. they were up, running and actually forged the block. but whatever unknown reason caused them to finally not adopt it even localy. Hence also not anounced to the network. All this happened in a row of 5 minutes, or roughly ~15 blocks. It is unclear what caused it, and what happened then to get out of this situation.

abailly-iohk commented 1 year ago

Looking at the code for my own education I could trace the possible origin of this message to be in the GC process. The trace is emitted when getIsInvalidBlock returns Nothing, and this function is defined as:

 fmap (fmap (fmap invalidBlockReason) . flip Map.lookup) <$> readTVar cdbInvalid

and cdbInvalid is only modified in the garbageCollect function. So it's as if the newly created block had been garbage collected before it could be propagated?

abailly-iohk commented 1 year ago

@gufmar Do you have a PerformedGC trace somewhere around this TraceDidntAdoptBlock?

nfrisby commented 1 year ago

As far as I can tell, that forum thread has five blocks of interest.

(by ascending slot)

hash prefix	slot	blockno	predecessor	slot as UTC time	full hash
9c1	78831128	8107958	don't care	don't care	9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac
14a	78831144	8107959	9c1	2022-12-07T-07:17:15.00Z	14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c
15c	78831181	8107959	9c1	2022-12-07T-07:17:52.00Z	15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803
0a2	78831351	8107959	9c1	2022-12-07T-07:20:42.00Z	0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139
bbe	78831428	8107960	14a	2022-12-07T-07:21:59.00Z	bbe6c6301d467bb8277e7c461782c8bc34ea62eefd893ecd1b747a0522ac58af

9c1, 14a, and bbe are on the mainnet chain. The alarm begins because 14a and bbe are 4m44s apart, which is "big" for mainnet Cardano. We have two typical hypotheses for some gap might be that big.

The (random!) leader schedule happened to be empty during that duration. The chance of any given gap being >=5minutes is (0.95)^(5*60) = 2.075e-7. But there are a lot of slots in the year. This next arithmetic naively ignores overlap, but the rough order of magnitude is: 2.075e-7 * (365*24*60*60) = approx 6.5 times per year.
Some ledger calculation (accidentally) took too long, interfering by starving the leadership-checking thread of its access to the CPU. These blocks around about 8% through the epoch --- I don't know off the top of my head if we have any known computation spikes around that slot.

A third (bonus 😄) factor is the timeouts.

    mustReplyTimeout <- Just <$> randomElem [90, 135, 180, 224, 269]

If no new block arises before that timeout (randomized per upstream connection), then we disconnect from that peer. Depending on re-connection delays etc, that might amplify the duration of some stall in the chain growth.

Thanks to those sharing logs, we do have some more specific information in this case 👍

In particular, we have two reports of TraceDidntAdoptBlock.

BBHMM gave us 15c (on1.35.3:950c4, I think?) https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/29
Ada4Good gave us 0a2 (on v1.35.4:ebc7b, I think?) https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/34
(There's also a third "lost block" from staking4ada.org, but no logs/etc. v1.35.4 and recalls seeing gliveview say "invalid"; I don't know what to soundly infer from that, though. They did include Grafana graphs that showed very low CPU usage the whole time.)

Relevant facts:

15c and 0a2 are in the gap, but they were competing with 14a (the winner) to be block number 8107959 --- all three extended 9c1.
The VRF output of 14a begins 04..., which means it probably has a stronger tiebreaker than 15c and 0a2. I doubt the logs would include the VRF output of 15c and 0a2 to confirm this, unfortunately. (Those two leading hexdigits are only greater than four other possible hexdigit pairs out of the 255 non-equivalent pairs.)
Both Ada4Good and BBHMM shared logs that spanned from TraceStartLeadershipCheck to TraceDidntAdoptBlock in less than 70ms and 60ms respectively. That means the blocks they forged were processed by the ChainDB quicker than that. Note that processed does not necessarily mean validated: if a better block is already in there, the ChainDB won't waste time validating the new block. But it's not unusual for a block to be validated in less than 100ms.
The submitted log messages were exactly on time; they have the same exact timestamp as the above table.

So, it could be the case that only 15c and 0a2 were minted during the gap (ie no one else was elected). If they had extended 14a instead of 9c1, then the gap would be smaller.

But! Why didn't 15c and 0a2 extend 14a instead of its predecessor 9c1?

They're 37 seconds and 207 seconds later, so presumably they had already received 14a and validated 14a.
The TraceDidntApplyBlock message suggests a better block 8107959 was already in the ChainDB when these new blocks were added to it.
But it's just 70ms --- so presumably that better block already existed in the ChainDB. Why wasn't it chosen as the block to extend when minting 15c/0a2?

I am puzzled. Seems unlikely 14a would happen to arrive at just the wrong time at two nodes, 170 seconds apart for this to happen.

Other possibilities:

An invalid transaction made it through our defense, but getIsInvalidBlock was somehow not updated, so the minting thread reported TraceDidntAdoptBlock instead of TraceForgedInvalidBlock? Could "it took that long to flush the bad transaction" explain the gap size? (Assuming there are more like BBHMM and Ada4Good who just haven't been vocal.) This would require that 14a hadn't arrived to BBHMM within 37 seconds, which is possible, but not typical. And for there to be an invalid transaction in the new block, which seems unlikely. Or maybe another invalid 8107959 block that isn't showing up in these logs?
Something is causing alignment between block validations and leadership checks. Such that the leadership thread will choose its new block's predecessor, then a latest-greatest block is added to the ChainDB, then the leadership thread adds its newly minted block (which happens to lose a tiebreaker against the latest-greatest).

I noticed this while writing the above.

Looking more closely at Ada4Good's logs now, I actually see it select 14a between TraceStartLeadershipCheck and TraceDidntAdoptBlock!

(I remain as mystified as ever about the duplication of AddedToCurrentChain and the absurd elision count 😬.)

TraceStartLeadershipCheck   {"app":[],"at":"2022-12-07T07:20:42.00Z","data":{"kind":"TraceStartLeadershipCheck","slot":78831351},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"82143","sev":"Info","thread":"375"}
TraceNodeIsLeader       {"app":[],"at":"2022-12-07T07:20:42.03Z","data":{"val":{"kind":"TraceNodeIsLeader","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
TraceForgedBlock        {"app":[],"at":"2022-12-07T07:20:42.05Z","data":{"val":{"block":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
AddedToCurrentChain [14a]   {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
<elisions>          {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"kind":"LogValue","name":"before next, messages elided","value":{"contents":427653286419604,"tag":"PureI"}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
AddedToCurrentChain [14a]   {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
TrySwitchToAFork [0a2]      {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"block":{"hash":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","kind":"Point","slot":78831351},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Info","thread":"365"}
TraceDidntAdoptBlock        {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"val":{"kind":"TraceDidntAdoptBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Error","thread":"375"}

So: this is exactly the anticipated reason for ever seeing TraceDidntAdoptBlock. The node started its leadership check, chose to extend the tip of the chain (ie 9c1), minted that block (ie 0a2), but finished validating and selected a different better block (ie 14a) before its newly minted block had a chance to be considered.

What remains odd is that it finished validating 14a 207 seconds after 14a was minted and happened to do so exactly during its leadership attempt, which lasted between 60 and 70 milliseconds? Seems like real bad luck if there turns out to be no systematic correlation here after all...

I don't see this in the logs that BBHMM submitted. But I also don't see TrySwitchToAFork. So either they grep-ed out intervening messages, or their node took the alternate TryAddToCurrentChain path (which has Debug severity as opposed to TrySwitchToAFork's Info severity, since it's the typical path. Although leading is so rare, perhaps we should make it Info too, just for this kind of examination) and then they wouldn't fit this hypothesis.

But the two leadership tests were 170 seconds apart; seems very unlikely that two nodes would be validating 14a at such different times and also exactly during their leadership checks.

chadle-git commented 1 year ago

Hi @nfrisby, Thank you for analyzing. I found the same type scenario in my logs. Let me know if you need any more data that can help with this.

TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:17:52.00Z","data":{"chainDensity":4.78205e-2,"credentials":"Cardano","delegMapSize":1237930,"kind":"TraceStartLeadershipCheck","slot":78831181,"utxoSize":9404848},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"1429","sev":"Info","thread":"214"}

TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:17:52.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

TraceForgedBlock {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"credentials":"Cardano","val":{"block":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

AddedToCurrentChain {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Notice","thread":"204"}

TrySwitchToAFork {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"block":{"hash":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","kind":"Point","slot":78831181},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Info","thread":"204"}

TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Error","thread":"214"}

Straightpool commented 1 year ago

I have observed the same for the first time this epoch. Will update when I learn more from the logs (BP and both relays).

First snipped from the BP, anyone saw the same error at the same time frame?

{"app":[],"at":"2023-01-02T05:13:57.00Z","data":{"chainDensity":4.7301885e-2,"credentials":"Cardano","delegMapSize":1251425,"kind":"TraceStartLeadershipCheck","slot":81070146,"utxoSize":9553777},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"block":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","blockNo":8217108,"blockPrev":"9306145db376fd2aa74ea1bb24ed915f2f69a66d40dfdef2f68e82b53141a337","kind":"TraceForgedBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"block":{"hash":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","kind":"Point","slot":81070146},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"833","sev":"Info","thread":"343"} {"app":[],"at":"2023-01-02T05:13:57.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Error","thread":"353"}

Jimbo4350 commented 1 year ago

Please re-open this issue in https://github.com/input-output-hk/ouroboros-network

chadle-git commented 1 year ago

Opened Issue https://github.com/input-output-hk/ouroboros-network/issues/4252

IntersectMBO / cardano-node

[potential BUG] - TraceDidntAdoptBlock for 5 minutes #4720