Closed gitmachtl closed 2 years ago
height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.
@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.
Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.
@gitmachtl - can I ask, are all these 'V7' systems deployed using the same scripts? Just seeing if there could be a common configuration issue that underlies this: Note Only has one reporting node.
@njd42 as far as i can tell - and i have spoken to a bunch of those SPOs - they are all running completely different systems. different deploy scripts, different locations, different kind of hardware (vm, dedicated server, ...).
the number of nodes reporting are only the ones that are actively reporting them to pooltool, but the height battles can be confirmed by others too.
For what it's worth: we are running our own compiled node in Docker (own Dockerfile) on AWS ECS and migrated to 1.35.1 (too) early.
Haven't done the amount of investigation that @gitmachtl has so I can only share sparse findings now, but what I could see is that our BP would mint a block omitting the preceding block (thus getting dropped by the next leader) even though there was sometimes 10+ slots in between. Our relays never received that preceding block (checked logs) so naturally the BPs also didn't. Will keep an eye on this issue and report when I've done more checks
Example: block 7530726
Block 7530725 arrived at our BP:
{
"app": [],
"at": "2022-07-22T12:25:00.17Z",
"data": {
"chainLengthDelta": 1,
"kind": "TraceAddBlockEvent.AddedToCurrentChain",
"newtip": "4e4e385bae46bc9edb06c5203a70ccf64dabc64724aad4f4256f1f99cbaaf80d@66926409"
},
"env": "1.35.1:c7545",
"host": "ip-172-2",
"loc": null,
"msg": "",
"ns": [
"cardano.node.ChainDB"
],
"pid": "82",
"sev": "Notice",
"thread": "12135"
}
Then our block minted and reached our relays (three each on 1.35.1, on with p2p enabled). Then 18 slots later the block from 4f3410f074e7363091a1cc1c21b128ae423d1cce897bd19478e534bb arrived with hash 3150a0d4502f1d0995c4feae3fcb93dd505671cf8dc84bbf0b761e2ee64d70dc and our relays switched fork.
The number of incoming connections on our relay is healthy and our block propagation seems otherwise fine.
@rene84 what if you have two relay, one in 1.34.x and one in 1.35.x? does your bp receive that preceding block?
I think that the connectivity partition is a much more likely root cause than an issue with chain selection given the evidence at hand. However, such a partition is highly unlikely to occur through pure random chance, which suggests some sort of trigger.
If you scan through your logs looking for, say ErrorPolicySuspendPeer
messages? What we are looking for is a rash of those occurring at around the same time causing the 1.34.x and 1.35.x nodes to drop connections with each other.
Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?
Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to https://github.com/input-output-hk/cardano-node/issues/4226.
In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.
It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?
Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?
As far as i know the SPOs were running 1.35.1 nodes in non p2p mode on mainnet. It is really a strange thing that 1.35. nodes keep on sticking with the block of other 1.35. nodes and not picking up a resolved 1.34.* block. 1.35.2 was not used i guess, its just too fresh out of the box.
Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.
In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.
It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?
Could be, but the same relays/BPs later on are producing normal blocks.
These are the occurrences of ErrorPolicySuspendPeer
on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p
In UTC
I am not sure if we're mixing up two different issues here. I would love to hear @coot 's oppinion on the height battle issues on the mainnet, its really outstanding and not just a glitch.
We haven't seen such long heights battles before, it certainly deserves attention.
The strange thing is, that with those double height battles, the 1.35. only builds on the lost block of the 1.35. before. Normally it should pick up the block from the previous winner of course. And the fact that it happend than again is really bad. Creating mini forks all the time. I was on a full 1.35.1 setup at this time on mainnet, relays and BPs. I reverted back to a full 1.34.1 setup and had no issues like that anymore since than. I know that this is hard to simulate, because on the Testnet we now are in babbage-era and all nodes are 1.35.*. On the Mainnet we have the mixed situation. But SPOs are now aware, and it could lead to the case that SPOs are only updating in the very last moment to stay out of such bugs. Thats not a good solution to go into this HF i would say.
These are the occurrences of
ErrorPolicySuspendPeer
on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p
P2P
nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous
and Error
as here (we ought to change it to something better, Error
is terrible for grepping).
What is consistent with what I see in the logs is that some blocks just never make it even after 10 or more seconds. Both incoming and outgoing I've seen that. Seems more likely due to a propagation issue in the network eg 1.34.x and 1.35.x not talking rather then there being a bug in the fork selection mechanism in the node itself
Can I ask for logs from a block producer which was connected to a 1.35.x
node from the time when it lost a slot battle. It would help us validate our hypothesis. If one doesn't want to share them publicly here, you can send them to my iohk email address (marcin.szamotulski@iohk.io
) or share them through discord.
Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1
Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1
I am not sure if this will help, but please send. It's more interesting to see 1.34.1
BP connected to 1.35.1
relay.
here is another example:
and another one, in this example i know for sure that PILOT is running 1.35.0 on all mainnet nodes in normal mode (not p2p). same goes for NETA1, 1.35.0 on all mainnet nodes (not p2p).
These are the occurrences of
ErrorPolicySuspendPeer
on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p
P2P
nodes use a different trace message, you'd need to search forTraceDemoteAsynchronous
andError
as here (we ought to change it to something better,Error
is terrible for grepping).
I searched for TraceDemoteAsynchronous
but did not get any results. I'm pretty confident we haven't enabled the correct trace for that log message. I do see regular PeerSelectionCounters
and the number of hotPeers changing from time to time but not Error
log message on the entire day of the 22nd of July (UTC)
@renesecur you need to enable TracePeerSelection
and TracePeerSelectionActions
.
We would like to get logs from a 1.34.1
relay connected to a 1.35.x
BP node which was part of any of the slot battles.
One more favour to ask, those of you who are running 1.35.x
BP, could you share your 1.35.x
relay? We'd like to connect a 1.34.1
relay and observe what's going on on our end.
One more favour to ask, those of you who are running
1.35.x
BP, could you share your1.35.x
relay? We'd like to connect a1.34.1
relay and observe what's going on on our end.
eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled
They are serving six 1.35.2 block producers. Let me know if you want me to add your relay to their topology
height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.
@papacarp Great stats. Could you make a plot that tracks slotbattles / (total block bytes in epoch) and heightbattles / (toal block bytes in epoch) ? The number of heights battles is expected to increase with increased utilization and looking back at earlier epochs they where more common a few months ago.
eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled
@feqifei thanks. Have you observed any invalid block on the 1.34.1
relay so far?
One more favour to ask, those of you who are running
1.35.x
BP, could you share your1.35.x
relay? We'd like to connect a1.34.1
relay and observe what's going on on our end.
I posted it in the SPO best practice group, lets see if we can collect some full 1.35.* setups here.
eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled
@feqifei thanks. Have you observed any invalid block on the
1.34.1
relay so far?
not anymore since 2022-07-23 8:00 utc. My BP lost several blocks due to bad/wrong propagation or 1.34.1 nodes rejection only during that 24 hrs starting from 2022-07-22 7:30 utc. Unfortunately I have no logs related to that timeframe
One more favour to ask, those of you who are running
1.35.x
BP, could you share your1.35.x
relay? We'd like to connect a1.34.1
relay and observe what's going on on our end.I posted it in the SPO best practice group, lets see if we can collect some full 1.35.* setups here.
18.202.232.130 3.249.180.234 3.249.41.89 54.74.196.6
these relays are running on 1.35.0 with port 3001. All BP nodes were upgraded to 1.35.0. In the topology, we have 4/7 relays running on 1.35.0, the left 3 are running on 1.34.1
One more favour to ask, those of you who are running
1.35.x
BP, could you share your1.35.x
relay? We'd like to connect a1.34.1
relay and observe what's going on on our end.
Relay1.thescenepool.com 6000 Relay2.thescenepool.com 6000 Relay3.thescenepool.com 6000 Relay4.thescenepool.com 6000
All running 1.35.0
Note my BP will not produce blocks due to low delegation but hopefully you can see something.
It happend again
Alright, reverting my rollback process, I'll keep running Berry on 1.35.0 until told otherwise on this thread.
relay.pasklab.com : 3001 | 3002 | 3003
one more case, MS9 on v7, LEAF on v6 We noticed that not-adopted blocks are propagated by very few nodes (between 50 and 60). We suspect that those 50-60 nodes are those ones that already moved to 1.35.x version, but we have no proof.
Maybe @papacarp can give us more information about those 53 nodes that report block hash 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091
Maybe @papacarp can give us more information about those 53 nodes that report block hash 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091
my BP node's (1.35.0) log for 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091 block:
[4eefd458:cardano.node.ChainDB:Notice:220] [2022-07-26 12:13:55.14 UTC] Chain extended, new tip: 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091 at slot 67271343
Then why other blocks (i.e. 9be22a26e9a650813b37ef457ecec6b7b8ac2493d865f9f5b5ec63ffb0b5f0d7) were minted on v7 and confirmed by 290 nodes? Or does it involve only Height battles?
I caught one fish but I think it's too small. One of my v7 BPs lost a height battle against a v6 BP, but the number of nodes that propagate its block is normal, so I don't think it's the situation we are trying to analyze. By the way I have logs of BP, 1.35.2 and 1.34.1 relays.
Edit: @coot , I sent you logs by email
If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block.
68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
is a suspected v7 block that lost out to 80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e
.
Grep for 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
and look for errors. If you are running with TraceChainSyncClient enabled the TraceDownloadedHeader
will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.
if anyone is able to give me the cbor bytes (hex would be nice) of one of these v7 blocks that loses such a height battle, I can check to see if there is any incompatibility in the serialization between v6 and v7.
If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block.
68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
is a suspected v7 block that lost out to80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e
. Grep for68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
and look for errors. If you are running with TraceChainSyncClient enabled theTraceDownloadedHeader
will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.
I see the 68c39 block in my two 1.35.2 relay logs but not in my two 1.34.1 relay logs
If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block.
68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
is a suspected v7 block that lost out to80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e
. Grep for68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9
and look for errors. If you are running with TraceChainSyncClient enabled theTraceDownloadedHeader
will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.I see the 68c39 block in my two 1.35.2 relay logs but not in my two 1.34.1 relay logs
Interesting. Do you have chaindb and chainsync client tracers enabled on the 1.34.1 relays?
"TraceChainDb": true,
"TraceChainSyncClient": true,
@karknu ChainDB true, ChainSyncClient false. both 1.34.1 nodes were busy on checking some DNS subscriptions, they didn't receive any block. 42 seconds later both received the adversary block 80572...
We noticed that not-adopted blocks are propagated by very few nodes (between 50 and 60). We suspect that those 50-60 nodes are those ones that already moved to 1.35.x version, but we have no proof.
yeah thats because its the number of nodes reporting to pooltool running 1.35.*, so this signals that only the other v7 nodes are getting/reporting the block
@karknu About your example I think that the reason why 1.34.1 nodes didn't receive information about block 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 is because the node already marked as invalid its predecessor 3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437, forged by another v7 BP.
So, I suppose the 1.34.1 already discharged that fork and didn't consider it. More interesting is the error when 1.34.1 decided to consider 3677... invalid
{"host":"ff01-rl0","pid":"448636","loc":null,"at":"2022-07-26T12:25:28.73Z","ns":["cardano.node.ChainDB"],"sev":"Error","env":"1.34.1:73f9a","data":{"kind":"TraceAddBlockEvent.AddBlockValidation.InvalidBlock","error":"ExtValidationErrorLedger (HardForkLedgerErrorFromEra S (S (S (S (Z (WrapLedgerErr {unwrapLedgerErr = BBodyError (BlockTransitionError [ShelleyInAlonzoPredFail (LedgersFailure (LedgerFailure (UtxowFailure (WrappedShelleyEraFailure (UtxoFailure (FeeTooSmallUTxO (Coin 1006053) (Coin 1001829))))))),ShelleyInAlonzoPredFail (LedgersFailure (LedgerFailure (UtxowFailure (WrappedShelleyEraFailure (UtxoFailure (FeeTooSmallUTxO (Coin 1011641) (Coin 1007417)))))))])}))))))","block":{"hash":"3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437","kind":"Point","slot":67272037}},"msg":"","thread":"406","app":[]}
Instead, 1.35.2 node considered this block (or at least its header) as valid {"app":[],"at":"2022-07-26T12:25:28.24Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437@67272037"},"env":"1.35.2:7612a","host":"gv01-rl0","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"2554846","sev":"Notice","thread":"400"}
@feqifei do you have a way of getting the serialized block corresponding to 3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437
? that would allow me to investigate if there is a difference in the fee calculation between 1.34 and 1.35.
@feqifei do you have a way of getting the serialized block corresponding to
3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437
? that would allow me to investigate if there is a difference in the fee calculation between 1.34 and 1.35.
Hi Jared, I don't know how to satisfy your request but consider that this is a block made by another BP (not mine) and it didn't reach the blockchain. So, even if I had enough skill to answer your question (and I have not :) ) I really don't know how to collect this info
Strange Issue
So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!
This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.
I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.
Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ⚠️
This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏
Here is a nightmare one, v7 built ontop of other v7 node (green path):