[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker?

gitmachtl commented 2 years ago

Strange Issue

So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!

This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.

I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.

Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ⚠️

This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! 🙏

Here is a nightmare one, v7 built ontop of other v7 node (green path):

JaredCorduan commented 2 years ago

ok, no problem @feqifei , your report above is already very helpful, it gives me a new lead!

papacarp commented 2 years ago

I have now correlated the reporter versions along with the battles. This is not on the public website yet, but I captured one during development that I think tells us what we need to know. The reporters of the orphaned blocks are generally on 1.35. Note that eventually all nodes will report the winning block because they have to. But if you watch this realtime you'll see initially no 1.34 nodes report the 1.35 block.

Next step is to get the block protocol versions into this now that cncli was upgraded to report that info as well. so we will know if the orphan block (as well as the chained block) are produced by a 1.35 node

gitmachtl commented 2 years ago

FeeTooSmall issue!? Hmm... so maybe a complex thing with smart contracts involved? And those v7 blocks that are getting accepted normally by v6 nodes don't have smart contracts in them or luckily enough fees set?

JaredCorduan commented 2 years ago

FeeTooSmall issue!? Hmm... so maybe a complex thing with smart contracts involved? And those v7 blocks that are getting accepted normally by v6 nodes don't have smart contracts in them or luckily enough fees set?

I strongly suspect the bug is something like that :point_up: . I'm investigating this now, just comparing the code. A reproducible would be the most helpful thing right now, ie a serialized v7 block that lost one of these height battles.

gitmachtl commented 2 years ago

yep, but hard to capture and store for a "normal spo" i guess.

gitmachtl commented 2 years ago

@JaredCorduan could it be the utxoCostPerByte conversion? Because its now 4310 lovelaces per byte thats 34480 lovelaces per word. But the current parameters on the mainnet are set to 34482 lovelaces. So 1.35. nodes are fine with 34480, but 1.34. nodes not?

JaredCorduan commented 2 years ago

@gitmachtl I don't think so, that only kicks in with the Babbage era, and we are seeing this on mainnet (alonzo).

gitmachtl commented 2 years ago

The calculation is based on the bytes in babbage, but maybe 1.35.0/1.35.1 nodes are using it internally with the *8 to get to the utxoCostPerWord value also for the alonzo era? But that would be an easy test to do. I mean, IOG could simply do a parameter update on the mainnet to a utxoCostPerWord value of 34480 right now.

Eztero commented 2 years ago

One more favour to ask, those of you who are running 1.35.x BP, could you share your 1.35.x relay? We'd like to connect a 1.34.1 relay and observe what's going on on our end.

I have the producer and the relay in the version 1.35.0

relay: 51.222.156.238 : 3001

karknu commented 2 years ago

@JaredCorduan https://gist.github.com/karknu/3ba79778b83f35d43789ff68436b114f is a cbor of the loosing block for the height battle for block number 7547634 . It is manually stitched together from packets in a pcap so may not be 100% correct.

gitmachtl commented 2 years ago

From Conrad:

relay01.bladepool.com:3001 relay02.bladepool.com:3001 Current set up: BP: 1.35.2 Relays: 1.34.1 and 1.35.2 (mixed environment)

gitmachtl commented 2 years ago

@gitmachtl I don't think so, that only kicks in with the Babbage era, and we are seeing this on mainnet (alonzo).

Maybe some strange utf encoding/decoding issues for metadata and co causing differences in the fee calculation?

papacarp commented 2 years ago

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.

Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.

I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.

gitmachtl commented 2 years ago

So the scenario is: 1.35. node is producing a block, for some reason a 1.34. (or below) node sees this block as an invalid block and rejects the block and in addition terminates the connection to the 1.35. node. Only other 1.35. nodes are seeing the made block as valid and continue to build on them. The 1.35. nodes are getting the blocks so fast, that they reject the height battle winner block from a 1.34. (or below) node. Then they are building there own block and the scenario repeats.

nemo83 commented 2 years ago

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual. Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.

I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.

Hi, I've opened a bug exzctly for this reason. https://github.com/input-output-hk/cardano-node/issues/4226

My BP is 1.34.1 and I've got 1 relay on 1.35.1. I've seen the BP periodically loosing connection to 1.35.1 and failing to restore it. You can check the logs provided in the ticket for more info. I really hope thi helps!

Great work everyone!

I did update to 1.35.2 and also in this case I have experienced the BP to drop connection to 1.35.2 and never re-enstate it (unless I bump the relay.)

JaredCorduan commented 2 years ago

I think I have found the problem. If I'm correct, it is a difference between how 1.34 and 1.35 compute the minimum fee.

There are two functions named minfee with nearly the same signature:

https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/shelley/impl/src/Cardano/Ledger/Shelley/Tx.hs#L586-L597

https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/alonzo/impl/src/Cardano/Ledger/Alonzo/Tx.hs#L322-L340

We refactored our rules post node 1.34, attempting to reduce code copy, but ended up placing the Shelley calculation in the Alonzo rules:

https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/alonzo/impl/src/Cardano/Ledger/Alonzo/Rules/Utxo.hs#L288

The difference between the two calculations is that the Alonzo one takes the script execution into account. This means that 1.34 nodes accept a (strict) subset of blocks that 1.35 deems valid. So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.

In hindsight, we should have had golden tests for the fee calculation in each era (we have them for some eras, but it seems we missed the Alonzo ones). I do not understand yet why our property tests did not catch this, as it seems to be using the correct minfee:

https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/alonzo/test-suite/src/Test/Cardano/Ledger/Alonzo/AlonzoEraGen.hs#L34-L39

JaredCorduan commented 2 years ago

~If I am able to push out a quick fix on top of node 1.35.2, would y'all feel confident in your ability to see if the problem is fixed by my patch on the testnet? In the meantime I will think about how best to test this properly.~

Never mind, Babbage is actually using the correct minfee calculation: https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/babbage/impl/src/Cardano/Ledger/Babbage/Rules/Utxo.hs#L32

gitmachtl commented 2 years ago

Thx @JaredCorduan , this sounds reasonable. Glad that i am not crazy as some said to me ... 😆

nemo83 commented 2 years ago

~If I am able to push out a quick fix on top of node 1.35.2, would y'all feel confident in your ability to see if the problem is fixed by my patch on the testnet? In the meantime I will think about how best to test this properly.~

Never mind, Babbage is actually using the correct minfee calculation: https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/babbage/impl/src/Cardano/Ledger/Babbage/Rules/Utxo.hs#L32

What about Alonzo? I know I'm stating the obvious, but 1.35.x is showing problems in Alonzo.

EDIT: Sorry not sure how I mentioned "Shelly", I obviously meant Alonzo

JaredCorduan commented 2 years ago

What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.

The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...

reqlez commented 2 years ago

If you need more relays to test before i downgrade, PSB: adarelay04.psilobyte.io:3004 ( Uruguay ) adarelay01.psilobyte.io:3001 ( Japan )

I had perfect epochs all month tho, and have not found this issue at all in my case, with 1.35.x

feqifei commented 2 years ago

Great, downgrading to 1.34.1 and waiting for 1.35.3 :)

gitmachtl commented 2 years ago

Lets wait for 1.36.0 😆 But seriously, IOG should consider to upgrade there BPs to a release candidate on mainnet this time and do some testing before a new release version is annouced. Also, we should remove 1.35.0 as a release here on github IMO.

JaredCorduan commented 2 years ago

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

reqlez commented 2 years ago

thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.

I have a day job as well... sorry... i wish i could go 100% Cardano ;-) This is clearly a "just in case" vs. an "emergency" downgrade, and I been running it for a month now, but I will get to it eventually.

nemo83 commented 2 years ago

What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.

The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...

lool sorry, Alonzo! I meant Alonzo :disappear:

karknu commented 2 years ago

@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual. Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.

I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.

I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.

That should be easy to test by configuring a 1.34 relay to only have a single 1.35 node as its upstream peer.

JaredCorduan commented 2 years ago

I did write "resolves #2936 in the ledger PR notes that I think fixes the bug, but I did not realize that those magic words would close issues cross-repo. I think it's best to wait for more testing before this node issues is closed. sorry.

JaredCorduan commented 2 years ago

So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.

I want to clarify that what I said here :point_up: is incorrect. This was actually an accidental hard fork. Blocks produced by nodes with the bug would potentially not be valid according to a 1.34 node. If the bug had instead been that a higher fee was demanded, then it would have been a soft fork, since 1.34 nodes would still validate all the blocks from nodes with the bug.

dorin100 commented 2 years ago

@gitmachtl is it ok to close this issue now?

gitmachtl commented 2 years ago

Yes, was resolved with 1.35.3. Thx.

IntersectMBO / cardano-node

[BUG] - 1.35.* v7 nodes on Mainnet are picking up the wrong parent-block - Vasil HF Blocker? #4228

Strange Issue

Here is a nightmare one, v7 built ontop of other v7 node (green path):