[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker?

gitmachtl commented 2 years ago

Strange Issue

So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!

This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.

I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.

Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ⚠️

This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏

Here is a nightmare one, v7 built ontop of other v7 node (green path):

papacarp commented 2 years ago

height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.

njd42 commented 2 years ago

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.

Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

njd42 commented 2 years ago

@gitmachtl - can I ask, are all these 'V7' systems deployed using the same scripts? Just seeing if there could be a common configuration issue that underlies this: Note Screenshot 2022-07-24 at 18 45 43 Only has one reporting node.

gitmachtl commented 2 years ago

@njd42 as far as i can tell - and i have spoken to a bunch of those SPOs - they are all running completely different systems. different deploy scripts, different locations, different kind of hardware (vm, dedicated server, ...).

the number of nodes reporting are only the ones that are actively reporting them to pooltool, but the height battles can be confirmed by others too.

rene84 commented 2 years ago

For what it's worth: we are running our own compiled node in Docker (own Dockerfile) on AWS ECS and migrated to 1.35.1 (too) early.

Haven't done the amount of investigation that @gitmachtl has so I can only share sparse findings now, but what I could see is that our BP would mint a block omitting the preceding block (thus getting dropped by the next leader) even though there was sometimes 10+ slots in between. Our relays never received that preceding block (checked logs) so naturally the BPs also didn't. Will keep an eye on this issue and report when I've done more checks

renesecur commented 2 years ago

Example: block 7530726

Block 7530725 arrived at our BP:

{
    "app": [],
    "at": "2022-07-22T12:25:00.17Z",
    "data": {
        "chainLengthDelta": 1,
        "kind": "TraceAddBlockEvent.AddedToCurrentChain",
        "newtip": "4e4e385bae46bc9edb06c5203a70ccf64dabc64724aad4f4256f1f99cbaaf80d@66926409"
    },
    "env": "1.35.1:c7545",
    "host": "ip-172-2",
    "loc": null,
    "msg": "",
    "ns": [
        "cardano.node.ChainDB"
    ],
    "pid": "82",
    "sev": "Notice",
    "thread": "12135"
}

Then our block minted and reached our relays (three each on 1.35.1, on with p2p enabled). Then 18 slots later the block from 4f3410f074e7363091a1cc1c21b128ae423d1cce897bd19478e534bb arrived with hash 3150a0d4502f1d0995c4feae3fcb93dd505671cf8dc84bbf0b761e2ee64d70dc and our relays switched fork.

The number of incoming connections on our relay is healthy and our block propagation seems otherwise fine.

andreabravetti commented 2 years ago

@rene84 what if you have two relay, one in 1.34.x and one in 1.35.x? does your bp receive that preceding block?

njd42 commented 2 years ago

I think that the connectivity partition is a much more likely root cause than an issue with chain selection given the evidence at hand. However, such a partition is highly unlikely to occur through pure random chance, which suggests some sort of trigger.

If you scan through your logs looking for, say ErrorPolicySuspendPeer messages? What we are looking for is a rash of those occurring at around the same time causing the 1.34.x and 1.35.x nodes to drop connections with each other.

coot commented 2 years ago

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

nemo83 commented 2 years ago

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to https://github.com/input-output-hk/cardano-node/issues/4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

gitmachtl commented 2 years ago

Has this behaviour been observed on non P2P 1.35.2 nodes, or is this limited to P2P only nodes?

As far as i know the SPOs were running 1.35.1 nodes in non p2p mode on mainnet. It is really a strange thing that 1.35. nodes keep on sticking with the block of other 1.35. nodes and not picking up a resolved 1.34.* block. 1.35.2 was not used i guess, its just too fresh out of the box.

gitmachtl commented 2 years ago

Great work @gitmachtl I'm wondering if the issue you're reporting is somehow related to #4226.

In my case, given the importance of the upgrade, I only bumped up the version of one of my co-located relays. What I've noticed is that, multiple times, the BP running 1.34.1 would stop talking to the relay running 1.35.1. Only way to restore was to bounce the relay on 1.35.1.

It looks like 1.35.1 nodes eventually stop talking to 1.34.1 and maybe start generating heights battles?

Could be, but the same relays/BPs later on are producing normal blocks.

renesecur commented 2 years ago

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

In UTC

gitmachtl commented 2 years ago

I am not sure if we're mixing up two different issues here. I would love to hear @coot 's oppinion on the height battle issues on the mainnet, its really outstanding and not just a glitch.

coot commented 2 years ago

We haven't seen such long heights battles before, it certainly deserves attention.

gitmachtl commented 2 years ago

The strange thing is, that with those double height battles, the 1.35. only builds on the lost block of the 1.35. before. Normally it should pick up the block from the previous winner of course. And the fact that it happend than again is really bad. Creating mini forks all the time. I was on a full 1.35.1 setup at this time on mainnet, relays and BPs. I reverted back to a full 1.34.1 setup and had no issues like that anymore since than. I know that this is hard to simulate, because on the Testnet we now are in babbage-era and all nodes are 1.35.*. On the Mainnet we have the mixed situation. But SPOs are now aware, and it could lead to the case that SPOs are only updating in the very last moment to stay out of such bugs. Thats not a good solution to go into this HF i would say.

coot commented 2 years ago

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

rene84 commented 2 years ago

What is consistent with what I see in the logs is that some blocks just never make it even after 10 or more seconds. Both incoming and outgoing I've seen that. Seems more likely due to a propagation issue in the network eg 1.34.x and 1.35.x not talking rather then there being a bug in the fork selection mechanism in the node itself

coot commented 2 years ago

Can I ask for logs from a block producer which was connected to a 1.35.x node from the time when it lost a slot battle. It would help us validate our hypothesis. If one doesn't want to share them publicly here, you can send them to my iohk email address (marcin.szamotulski@iohk.io) or share them through discord.

rene84 commented 2 years ago

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

coot commented 2 years ago

Sure. Afk atm so I can share that in about 5 hours from now. Note that I can only help you with BP on 1.35.1 connected to a relay on 1.35.1

I am not sure if this will help, but please send. It's more interesting to see 1.34.1 BP connected to 1.35.1 relay.

gitmachtl commented 2 years ago

here is another example:

gitmachtl commented 2 years ago

and another one, in this example i know for sure that PILOT is running 1.35.0 on all mainnet nodes in normal mode (not p2p). same goes for NETA1, 1.35.0 on all mainnet nodes (not p2p).

renesecur commented 2 years ago

These are the occurrences of ErrorPolicySuspendPeer on the three relays on the 22nd of July. First to were on 1.35.1 without p2p, the third one was on p2p

P2P nodes use a different trace message, you'd need to search for TraceDemoteAsynchronous and Error as here (we ought to change it to something better, Error is terrible for grepping).

I searched for TraceDemoteAsynchronous but did not get any results. I'm pretty confident we haven't enabled the correct trace for that log message. I do see regular PeerSelectionCounters and the number of hotPeers changing from time to time but not Error log message on the entire day of the 22nd of July (UTC)

coot commented 2 years ago

@renesecur you need to enable TracePeerSelection and TracePeerSelectionActions.

coot commented 2 years ago

We would like to get logs from a 1.34.1 relay connected to a 1.35.x BP node which was part of any of the slot battles.

coot commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

feqifei commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled

They are serving six 1.35.2 block producers. Let me know if you want me to add your relay to their topology

karknu commented 2 years ago

height battles are on the rise. I'll work on analyzing chained height battles also but wanted to drop this data here as part of it.

@papacarp Great stats. Could you make a plot that tracks slotbattles / (total block bytes in epoch) and heightbattles / (toal block bytes in epoch) ? The number of heights battles is expected to increase with increased utilization and looking back at earlier epochs they where more common a few months ago.

coot commented 2 years ago

eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled

@feqifei thanks. Have you observed any invalid block on the 1.34.1 relay so far?

gitmachtl commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

I posted it in the SPO best practice group, lets see if we can collect some full 1.35.* setups here.

feqifei commented 2 years ago

eu1.stakecool.io:4001 , 1.35.2 p2p enabled eu2.stakecool.io:4001 , 1.35.2 p2p disabled eu3.stakecool.io:4002 , 1.34.1 p2p disabled

@feqifei thanks. Have you observed any invalid block on the 1.34.1 relay so far?

not anymore since 2022-07-23 8:00 utc. My BP lost several blocks due to bad/wrong propagation or 1.34.1 nodes rejection only during that 24 hrs starting from 2022-07-22 7:30 utc. Unfortunately I have no logs related to that timeframe

stefiix92 commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

I posted it in the SPO best practice group, lets see if we can collect some full 1.35.* setups here.

18.202.232.130 3.249.180.234 3.249.41.89 54.74.196.6

these relays are running on 1.35.0 with port 3001. All BP nodes were upgraded to 1.35.0. In the topology, we have 4/7 relays running on 1.35.0, the left 3 are running on 1.34.1

TheScenePool commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

Relay1.thescenepool.com 6000 Relay2.thescenepool.com 6000 Relay3.thescenepool.com 6000 Relay4.thescenepool.com 6000

All running 1.35.0

Note my BP will not produce blocks due to low delegation but hopefully you can see something.

gitmachtl commented 2 years ago

It happend again

pascallapointe commented 2 years ago

Alright, reverting my rollback process, I'll keep running Berry on 1.35.0 until told otherwise on this thread.

relay.pasklab.com : 3001 | 3002 | 3003

feqifei commented 2 years ago

one more case, MS9 on v7, LEAF on v6 We noticed that not-adopted blocks are propagated by very few nodes (between 50 and 60). We suspect that those 50-60 nodes are those ones that already moved to 1.35.x version, but we have no proof.

Maybe @papacarp can give us more information about those 53 nodes that report block hash 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091

feqifei commented 2 years ago

stefiix92 commented 2 years ago

Maybe @papacarp can give us more information about those 53 nodes that report block hash 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091

my BP node's (1.35.0) log for 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091 block:

[4eefd458:cardano.node.ChainDB:Notice:220] [2022-07-26 12:13:55.14 UTC] Chain extended, new tip: 5ab7d7741fce303623790f551be495375728a61bb9febf1a2e81dcb1948b2091 at slot 67271343

Then why other blocks (i.e. 9be22a26e9a650813b37ef457ecec6b7b8ac2493d865f9f5b5ec63ffb0b5f0d7) were minted on v7 and confirmed by 290 nodes? Or does it involve only Height battles?

feqifei commented 2 years ago

I caught one fish but I think it's too small. One of my v7 BPs lost a height battle against a v6 BP, but the number of nodes that propagate its block is normal, so I don't think it's the situation we are trying to analyze. By the way I have logs of BP, 1.35.2 and 1.34.1 relays.

Edit: @coot , I sent you logs by email

karknu commented 2 years ago

If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block.

68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 is a suspected v7 block that lost out to 80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e.
Grep for 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 and look for errors. If you are running with TraceChainSyncClient enabled the TraceDownloadedHeader will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.

JaredCorduan commented 2 years ago

if anyone is able to give me the cbor bytes (hex would be nice) of one of these v7 blocks that loses such a height battle, I can check to see if there is any incompatibility in the serialization between v6 and v7.

feqifei commented 2 years ago

If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block.

68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 is a suspected v7 block that lost out to 80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e. Grep for 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 and look for errors. If you are running with TraceChainSyncClient enabled the TraceDownloadedHeader will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.

I see the 68c39 block in my two 1.35.2 relay logs but not in my two 1.34.1 relay logs

karknu commented 2 years ago

If anyone is running a relay with a pre 1.35.x version it would be interesting to see how that node deals with a V7 block. 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 is a suspected v7 block that lost out to 80572fdeb5dd826c56ff4cae57ce2d0b17b0fd6aed1cfac692431a558293fe5e. Grep for 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 and look for errors. If you are running with TraceChainSyncClient enabled the TraceDownloadedHeader will give you the addresses of the peers that presented your node with that block. Grep for that ip addresses to see what your node does with that peer.

I see the 68c39 block in my two 1.35.2 relay logs but not in my two 1.34.1 relay logs

Interesting. Do you have chaindb and chainsync client tracers enabled on the 1.34.1 relays?

  "TraceChainDb": true,
  "TraceChainSyncClient": true,

feqifei commented 2 years ago

@karknu ChainDB true, ChainSyncClient false. both 1.34.1 nodes were busy on checking some DNS subscriptions, they didn't receive any block. 42 seconds later both received the adversary block 80572...

gitmachtl commented 2 years ago

We noticed that not-adopted blocks are propagated by very few nodes (between 50 and 60). We suspect that those 50-60 nodes are those ones that already moved to 1.35.x version, but we have no proof.

yeah thats because its the number of nodes reporting to pooltool running 1.35.*, so this signals that only the other v7 nodes are getting/reporting the block

feqifei commented 2 years ago

@karknu About your example I think that the reason why 1.34.1 nodes didn't receive information about block 68c39e855361e637ee76862d59dc04eb190398ae33cae7cbcd1b6f6cd921fac9 is because the node already marked as invalid its predecessor 3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437, forged by another v7 BP.

So, I suppose the 1.34.1 already discharged that fork and didn't consider it. More interesting is the error when 1.34.1 decided to consider 3677... invalid

{"host":"ff01-rl0","pid":"448636","loc":null,"at":"2022-07-26T12:25:28.73Z","ns":["cardano.node.ChainDB"],"sev":"Error","env":"1.34.1:73f9a","data":{"kind":"TraceAddBlockEvent.AddBlockValidation.InvalidBlock","error":"ExtValidationErrorLedger (HardForkLedgerErrorFromEra S (S (S (S (Z (WrapLedgerErr {unwrapLedgerErr = BBodyError (BlockTransitionError [ShelleyInAlonzoPredFail (LedgersFailure (LedgerFailure (UtxowFailure (WrappedShelleyEraFailure (UtxoFailure (FeeTooSmallUTxO (Coin 1006053) (Coin 1001829))))))),ShelleyInAlonzoPredFail (LedgersFailure (LedgerFailure (UtxowFailure (WrappedShelleyEraFailure (UtxoFailure (FeeTooSmallUTxO (Coin 1011641) (Coin 1007417)))))))])}))))))","block":{"hash":"3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437","kind":"Point","slot":67272037}},"msg":"","thread":"406","app":[]}

Instead, 1.35.2 node considered this block (or at least its header) as valid {"app":[],"at":"2022-07-26T12:25:28.24Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437@67272037"},"env":"1.35.2:7612a","host":"gv01-rl0","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"2554846","sev":"Notice","thread":"400"}

JaredCorduan commented 2 years ago

@feqifei do you have a way of getting the serialized block corresponding to 3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437 ? that would allow me to investigate if there is a difference in the fee calculation between 1.34 and 1.35.

feqifei commented 2 years ago

@feqifei do you have a way of getting the serialized block corresponding to 3677ae1bb3697402f443a737a18eab5194c66f2efaec1ff2c2bc0122413d9437 ? that would allow me to investigate if there is a difference in the fee calculation between 1.34 and 1.35.

Hi Jared, I don't know how to satisfy your request but consider that this is a block made by another BP (not mine) and it didn't reach the blockchain. So, even if I had enough skill to answer your question (and I have not :) ) I really don't know how to collect this info

IntersectMBO / cardano-node

[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

Strange Issue

Here is a nightmare one, v7 built ontop of other v7 node (green path):