Closed gitmachtl closed 2 years ago
ok, no problem @feqifei , your report above is already very helpful, it gives me a new lead!
I have now correlated the reporter versions along with the battles. This is not on the public website yet, but I captured one during development that I think tells us what we need to know. The reporters of the orphaned blocks are generally on 1.35. Note that eventually all nodes will report the winning block because they have to. But if you watch this realtime you'll see initially no 1.34 nodes report the 1.35 block.
Next step is to get the block protocol versions into this now that cncli was upgraded to report that info as well. so we will know if the orphan block (as well as the chained block) are produced by a 1.35 node
FeeTooSmall issue!? Hmm... so maybe a complex thing with smart contracts involved? And those v7 blocks that are getting accepted normally by v6 nodes don't have smart contracts in them or luckily enough fees set?
FeeTooSmall issue!? Hmm... so maybe a complex thing with smart contracts involved? And those v7 blocks that are getting accepted normally by v6 nodes don't have smart contracts in them or luckily enough fees set?
I strongly suspect the bug is something like that :point_up: . I'm investigating this now, just comparing the code. A reproducible would be the most helpful thing right now, ie a serialized v7 block that lost one of these height battles.
yep, but hard to capture and store for a "normal spo" i guess.
@JaredCorduan could it be the utxoCostPerByte conversion? Because its now 4310 lovelaces per byte thats 34480 lovelaces per word. But the current parameters on the mainnet are set to 34482 lovelaces. So 1.35. nodes are fine with 34480, but 1.34. nodes not?
@gitmachtl I don't think so, that only kicks in with the Babbage era, and we are seeing this on mainnet (alonzo).
The calculation is based on the bytes in babbage, but maybe 1.35.0/1.35.1 nodes are using it internally with the *8 to get to the utxoCostPerWord value also for the alonzo era? But that would be an easy test to do. I mean, IOG could simply do a parameter update on the mainnet to a utxoCostPerWord value of 34480 right now.
One more favour to ask, those of you who are running
1.35.x
BP, could you share your1.35.x
relay? We'd like to connect a1.34.1
relay and observe what's going on on our end.
I have the producer and the relay in the version 1.35.0
relay: 51.222.156.238 : 3001
@JaredCorduan https://gist.github.com/karknu/3ba79778b83f35d43789ff68436b114f is a cbor of the loosing block for the height battle for block number 7547634 . It is manually stitched together from packets in a pcap so may not be 100% correct.
From Conrad:
relay01.bladepool.com:3001 relay02.bladepool.com:3001 Current set up: BP: 1.35.2 Relays: 1.34.1 and 1.35.2 (mixed environment)
@gitmachtl I don't think so, that only kicks in with the Babbage era, and we are seeing this on mainnet (alonzo).
Maybe some strange utf encoding/decoding issues for metadata and co causing differences in the fee calculation?
@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual.
Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.
I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.
I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.
So the scenario is: 1.35. node is producing a block, for some reason a 1.34. (or below) node sees this block as an invalid block and rejects the block and in addition terminates the connection to the 1.35. node. Only other 1.35. nodes are seeing the made block as valid and continue to build on them. The 1.35. nodes are getting the blocks so fast, that they reject the height battle winner block from a 1.34. (or below) node. Then they are building there own block and the scenario repeats.
@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual. Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.
I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.
I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.
Hi, I've opened a bug exzctly for this reason. https://github.com/input-output-hk/cardano-node/issues/4226
My BP is 1.34.1 and I've got 1 relay on 1.35.1. I've seen the BP periodically loosing connection to 1.35.1 and failing to restore it. You can check the logs provided in the ticket for more info. I really hope thi helps!
Great work everyone!
I did update to 1.35.2 and also in this case I have experienced the BP to drop connection to 1.35.2 and never re-enstate it (unless I bump the relay.)
I think I have found the problem. If I'm correct, it is a difference between how 1.34 and 1.35 compute the minimum fee.
There are two functions named minfee
with nearly the same signature:
We refactored our rules post node 1.34, attempting to reduce code copy, but ended up placing the Shelley calculation in the Alonzo rules:
The difference between the two calculations is that the Alonzo one takes the script execution into account. This means that 1.34 nodes accept a (strict) subset of blocks that 1.35 deems valid. So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.
In hindsight, we should have had golden tests for the fee calculation in each era (we have them for some eras, but it seems we missed the Alonzo ones). I do not understand yet why our property tests did not catch this, as it seems to be using the correct minfee
:
~If I am able to push out a quick fix on top of node 1.35.2, would y'all feel confident in your ability to see if the problem is fixed by my patch on the testnet? In the meantime I will think about how best to test this properly.~
Never mind, Babbage is actually using the correct minfee
calculation:
https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/babbage/impl/src/Cardano/Ledger/Babbage/Rules/Utxo.hs#L32
Thx @JaredCorduan , this sounds reasonable. Glad that i am not crazy as some said to me ... š
~If I am able to push out a quick fix on top of node 1.35.2, would y'all feel confident in your ability to see if the problem is fixed by my patch on the testnet? In the meantime I will think about how best to test this properly.~
Never mind, Babbage is actually using the correct
minfee
calculation: https://github.com/input-output-hk/cardano-ledger/blob/14e1bcc89e275600efb8b66c7cefeebfb1764204/eras/babbage/impl/src/Cardano/Ledger/Babbage/Rules/Utxo.hs#L32
What about Alonzo? I know I'm stating the obvious, but 1.35.x is showing problems in Alonzo.
EDIT: Sorry not sure how I mentioned "Shelly", I obviously meant Alonzo
What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.
The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...
If you need more relays to test before i downgrade, PSB: adarelay04.psilobyte.io:3004 ( Uruguay ) adarelay01.psilobyte.io:3001 ( Japan )
I had perfect epochs all month tho, and have not found this issue at all in my case, with 1.35.x
Great, downgrading to 1.34.1 and waiting for 1.35.3 :)
Lets wait for 1.36.0 š But seriously, IOG should consider to upgrade there BPs to a release candidate on mainnet this time and do some testing before a new release version is annouced. Also, we should remove 1.35.0 as a release here on github IMO.
thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.
thank you @reqlez, that is very kind of you, but please don't wait to downgrade on my/our account. downgrading to 1.34.1 on mainnet is indeed the best thing for everyone to do at the moment.
I have a day job as well... sorry... i wish i could go 100% Cardano ;-) This is clearly a "just in case" vs. an "emergency" downgrade, and I been running it for a month now, but I will get to it eventually.
What about shelley? I know I'm stating the obvious, but 1.35.x is showing problems in shelley.
The bug that I've found does not effect Shelley. What makes you think Shelley has problems? I'm not aware of a network still running Shelley...
lool sorry, Alonzo! I meant Alonzo :disappear:
@papacarp I note that pooltool.io/realtime is giving surprisingly number of nodes for certain blocks (i.e less than ten) - this seems unusual. Does anyone have evidence (i.e log entries) that show that any of these "height battles" above occurred when the superseded block had been received by the about-to-forge node? My working hypothesis would be some sort of connectivity partition, I'm looking for evidence that assumption is incorrect.
I watched a block get 50 reports and then like 10 seconds later a new block came in and it orphaned the first block. Seems like the 1.34 nodes never got the 1.35 block so we only got reports from the 1.35 nodes. Then once a new block came in the 1.34 nodes moved the chain forward.
I think a network partition is the right angle to explore. 1.34 nodes stop talking to 1.35 nodes.
That should be easy to test by configuring a 1.34 relay to only have a single 1.35 node as its upstream peer.
I did write "resolves #2936 in the ledger PR notes that I think fixes the bug, but I did not realize that those magic words would close issues cross-repo. I think it's best to wait for more testing before this node issues is closed. sorry.
So what we have is a soft fork where the "conservative nodes" (ie 1.34) are winning the longest chain.
I want to clarify that what I said here :point_up: is incorrect. This was actually an accidental hard fork. Blocks produced by nodes with the bug would potentially not be valid according to a 1.34 node. If the bug had instead been that a higher fee was demanded, then it would have been a soft fork, since 1.34 nodes would still validate all the blocks from nodes with the bug.
@gitmachtl is it ok to close this issue now?
Yes, was resolved with 1.35.3. Thx.
Strange Issue
So, i will post a bunch of pics here, screenshots taken from pooltool.io. Blockflow is from botton->top. These are only a few examples, there are many more!
This is happening on mainnet right now running 1.35.* nodes. There are many occasions with double & tripple height battles where newer v7 1.35.* nodes are picking up the wrong parent-block and try to build on another v7 block. So v6 1.34.* nodes are winning those all the time.
I personally had the issue loosing 10 height-battles within a 1-2 day window against v6 nodes. That was 100% of all my height-battles i had.
Its always the same pattern: There is a height battle that a v7 node looses against a v6 node. If there are also two nodes scheduled for the next block and one of them is a v7 node, it picks up the wrong lost block hash from the previous v7 node and builds on it. Of course it looses against the other v6 node which is building on the correct block. But as you can see in the example below, this can span multiple slotheights/blocks ā ļø
This is a Vasil-HF blocker IMO, because it would lead to the situation that SPOs are only upgrading to 1.35.* at the last possible moment before the HF, giving the ones staying on 1.34.1 an advantage. Not a good idea, it must be sorted out before. Q&A team please start an investigation on that asap, thx! š
Here is a nightmare one, v7 built ontop of other v7 node (green path):