Closed gufmar closed 1 year ago
Looking at the code for my own education I could trace the possible origin of this message to be in the GC process.
The trace is emitted when getIsInvalidBlock
returns Nothing
, and this function is defined as:
fmap (fmap (fmap invalidBlockReason) . flip Map.lookup) <$> readTVar cdbInvalid
and cdbInvalid
is only modified in the garbageCollect
function. So it's as if the newly created block had been garbage collected before it could be propagated?
@gufmar Do you have a PerformedGC
trace somewhere around this TraceDidntAdoptBlock
?
As far as I can tell, that forum thread has five blocks of interest.
(by ascending slot)
hash prefix | slot | blockno | predecessor | slot as UTC time | full hash |
---|---|---|---|---|---|
9c1 | 78831128 | 8107958 | don't care | don't care | 9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac |
14a | 78831144 | 8107959 | 9c1 | 2022-12-07T-07:17:15.00Z | 14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c |
15c | 78831181 | 8107959 | 9c1 | 2022-12-07T-07:17:52.00Z | 15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803 |
0a2 | 78831351 | 8107959 | 9c1 | 2022-12-07T-07:20:42.00Z | 0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139 |
bbe | 78831428 | 8107960 | 14a | 2022-12-07T-07:21:59.00Z | bbe6c6301d467bb8277e7c461782c8bc34ea62eefd893ecd1b747a0522ac58af |
9c1, 14a, and bbe are on the mainnet
chain. The alarm begins because 14a and bbe are 4m44s apart, which is "big" for mainnet
Cardano. We have two typical hypotheses for some gap might be that big.
The (random!) leader schedule happened to be empty during that duration. The chance of any given gap being >=5minutes is (0.95)^(5*60) = 2.075e-7
. But there are a lot of slots in the year. This next arithmetic naively ignores overlap, but the rough order of magnitude is: 2.075e-7 * (365*24*60*60)
= approx 6.5 times per year.
Some ledger calculation (accidentally) took too long, interfering by starving the leadership-checking thread of its access to the CPU. These blocks around about 8% through the epoch --- I don't know off the top of my head if we have any known computation spikes around that slot.
A third (bonus 😄) factor is the timeouts.
mustReplyTimeout <- Just <$> randomElem [90, 135, 180, 224, 269]
If no new block arises before that timeout (randomized per upstream connection), then we disconnect from that peer. Depending on re-connection delays etc, that might amplify the duration of some stall in the chain growth.
Thanks to those sharing logs, we do have some more specific information in this case 👍
In particular, we have two reports of TraceDidntAdoptBlock
.
BBHMM gave us 15c (on1.35.3:950c4, I think?) https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/29
Ada4Good gave us 0a2 (on v1.35.4:ebc7b, I think?) https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/34
(There's also a third "lost block" from staking4ada.org
, but no logs/etc. v1.35.4 and recalls seeing gliveview say "invalid"; I don't know what to soundly infer from that, though. They did include Grafana graphs that showed very low CPU usage the whole time.)
Relevant facts:
04...
, which means it probably has a stronger tiebreaker than 15c and 0a2. I doubt the logs would include the VRF output of 15c and 0a2 to confirm this, unfortunately. (Those two leading hexdigits are only greater than four other possible hexdigit pairs out of the 255 non-equivalent pairs.)TraceStartLeadershipCheck
to TraceDidntAdoptBlock
in less than 70ms and 60ms respectively. That means the blocks they forged were processed by the ChainDB quicker than that. Note that processed does not necessarily mean validated: if a better block is already in there, the ChainDB won't waste time validating the new block. But it's not unusual for a block to be validated in less than 100ms.So, it could be the case that only 15c and 0a2 were minted during the gap (ie no one else was elected). If they had extended 14a instead of 9c1, then the gap would be smaller.
But! Why didn't 15c and 0a2 extend 14a instead of its predecessor 9c1?
TraceDidntApplyBlock
message suggests a better block 8107959 was already in the ChainDB when these new blocks were added to it.I am puzzled. Seems unlikely 14a would happen to arrive at just the wrong time at two nodes, 170 seconds apart for this to happen.
Other possibilities:
An invalid transaction made it through our defense, but getIsInvalidBlock
was somehow not updated, so the minting thread reported TraceDidntAdoptBlock
instead of TraceForgedInvalidBlock
? Could "it took that long to flush the bad transaction" explain the gap size? (Assuming there are more like BBHMM and Ada4Good who just haven't been vocal.) This would require that 14a hadn't arrived to BBHMM within 37 seconds, which is possible, but not typical. And for there to be an invalid transaction in the new block, which seems unlikely. Or maybe another invalid 8107959 block that isn't showing up in these logs?
Something is causing alignment between block validations and leadership checks. Such that the leadership thread will choose its new block's predecessor, then a latest-greatest block is added to the ChainDB, then the leadership thread adds its newly minted block (which happens to lose a tiebreaker against the latest-greatest).
I noticed this while writing the above.
Looking more closely at Ada4Good's logs now, I actually see it select 14a between TraceStartLeadershipCheck
and TraceDidntAdoptBlock
!
(I remain as mystified as ever about the duplication of AddedToCurrentChain
and the absurd elision count 😬.)
TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:20:42.00Z","data":{"kind":"TraceStartLeadershipCheck","slot":78831351},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"82143","sev":"Info","thread":"375"}
TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:20:42.03Z","data":{"val":{"kind":"TraceNodeIsLeader","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
TraceForgedBlock {"app":[],"at":"2022-12-07T07:20:42.05Z","data":{"val":{"block":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
AddedToCurrentChain [14a] {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
<elisions> {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"kind":"LogValue","name":"before next, messages elided","value":{"contents":427653286419604,"tag":"PureI"}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
AddedToCurrentChain [14a] {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
TrySwitchToAFork [0a2] {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"block":{"hash":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","kind":"Point","slot":78831351},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Info","thread":"365"}
TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"val":{"kind":"TraceDidntAdoptBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Error","thread":"375"}
So: this is exactly the anticipated reason for ever seeing TraceDidntAdoptBlock
. The node started its leadership check, chose to extend the tip of the chain (ie 9c1), minted that block (ie 0a2), but finished validating and selected a different better block (ie 14a) before its newly minted block had a chance to be considered.
What remains odd is that it finished validating 14a 207 seconds after 14a was minted and happened to do so exactly during its leadership attempt, which lasted between 60 and 70 milliseconds? Seems like real bad luck if there turns out to be no systematic correlation here after all...
I don't see this in the logs that BBHMM submitted. But I also don't see TrySwitchToAFork
. So either they grep
-ed out intervening messages, or their node took the alternate TryAddToCurrentChain
path (which has Debug severity as opposed to TrySwitchToAFork
's Info severity, since it's the typical path. Although leading is so rare, perhaps we should make it Info too, just for this kind of examination) and then they wouldn't fit this hypothesis.
But the two leadership tests were 170 seconds apart; seems very unlikely that two nodes would be validating 14a at such different times and also exactly during their leadership checks.
Hi @nfrisby, Thank you for analyzing. I found the same type scenario in my logs. Let me know if you need any more data that can help with this.
TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:17:52.00Z","data":{"chainDensity":4.78205e-2,"credentials":"Cardano","delegMapSize":1237930,"kind":"TraceStartLeadershipCheck","slot":78831181,"utxoSize":9404848},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"1429","sev":"Info","thread":"214"}
TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:17:52.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}
TraceForgedBlock {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"credentials":"Cardano","val":{"block":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}
AddedToCurrentChain {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Notice","thread":"204"}
TrySwitchToAFork {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"block":{"hash":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","kind":"Point","slot":78831181},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Info","thread":"204"}
TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Error","thread":"214"}
I have observed the same for the first time this epoch. Will update when I learn more from the logs (BP and both relays).
First snipped from the BP, anyone saw the same error at the same time frame?
{"app":[],"at":"2023-01-02T05:13:57.00Z","data":{"chainDensity":4.7301885e-2,"credentials":"Cardano","delegMapSize":1251425,"kind":"TraceStartLeadershipCheck","slot":81070146,"utxoSize":9553777},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"block":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","blockNo":8217108,"blockPrev":"9306145db376fd2aa74ea1bb24ed915f2f69a66d40dfdef2f68e82b53141a337","kind":"TraceForgedBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"block":{"hash":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","kind":"Point","slot":81070146},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"833","sev":"Info","thread":"343"} {"app":[],"at":"2023-01-02T05:13:57.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Error","thread":"353"}
Please re-open this issue in https://github.com/input-output-hk/ouroboros-network
External
Area Other Any other topic (Delegation, Ranking, ...).
Summary At 2022-12-07 07:17:15am GMT from slot 78831144 and block 8107959 on, there was an extraordinary break in block production.
The next block was adopted at 07:21:59am GMT for slot 78831428 and block https://cardanoscan.io/block/8107960 So 284 Seconds or almost 5 minutes without a new block adopted on chain. Multiple SPOs reported logs from the producing node, having a successfull
TraceNodeIsLeader
>TraceForgedBlock
>AddedToCurrentChain
but then followed by an unexpectedTraceDidntAdoptBlock
System info (please complete the following information): Different blockproducing nodes running on v1.35.3 or .4 on mainnet
Screenshots and attachments Reports from at least 3 Pool operators with forged but not adopted blocks can be found at https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/32
Additional context https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-NodeKernel.html > TraceForgedBlock > TraceDidntAdoptBlock
So this is not an rarely expectable long slot gap, caused by VRF dice probability. The pools had scheduled slots. they were up, running and actually forged the block. but whatever unknown reason caused them to finally not adopt it even localy. Hence also not anounced to the network. All this happened in a row of 5 minutes, or roughly ~15 blocks. It is unclear what caused it, and what happened then to get out of this situation.