cardano-community / cncli

Apache License 2.0
53 stars 14 forks source link

Incorrect leader_vrf_0 values in chain table? #19

Closed TerminadaPool closed 1 year ago

TerminadaPool commented 1 year ago

Describe the bug Chain table format for leader_vrf_0 values has changed in recent versions and now these values appear to be incorrect as they no longer correlate with the slot battle winner.

To Reproduce Query the chain table for slot battles where the winner had a higher leader_vrf_0 value Eg:

sqlite3 /var/lib/cardano/cncli.db 'select winner.block_number, winner.slot_number, winner.pool_id, winner.hash, winner.leader_vrf_0, loser.leader_vrf_0 from chain winner inner join chain loser on winner.slot_number = loser.slot_number and winner.block_number = loser.block_number and winner.orphaned=0 and loser.orphaned=1 and winner.leader_vrf_0 > loser.leader_vrf_0 order by winner.block_number;'

Note that the winner should have a lower leader_vrf_0 value. The above query should therefore retrieve cases where the slot battle subsequently became a height battle. This happens rarely when another node builds the next block on top of the block that had the higher leader_vrf_0 value because it hadn't received the other block with the lower leader_vrf_0 value yet. Thus what should have been a slot battle becomes a height battle and is subsequently resolved by the longest chain rule. This should happen rarely and can be confirmed by review of log files on a node.

I noticed that the above query returns a lot more instances in more recent times. However, upon reviewing log files all these instances are simple slot battles which should have been resolved in favour of the block with the lower leader_vrf_0 score.

Since around slot 72236516 the leader_vrf_0 value hex string is now much shorter. In recent versions of cncli this shorter hex leader_vrf_0 value no longer appears to predict the slot battle winner.

Expected behavior According to Duncan Coutts' statement at https://github.com/input-output-hk/ouroboros-network/issues/2913#issuecomment-816953407 slot battles should be won by the block that has the lower leader vrf score.

Additional context It looks like the format of the chain table leader_vrf_0 values changed in recent versions of cncli. Are these values still correct?

TerminadaPool commented 1 year ago

@AndrewWestberg would you please describe how the leader vrf value is calculated for a block or how one can independently verify it?

AndrewWestberg commented 1 year ago

@TerminadaPool I'll check into this next week, but I've done a video on the babbage changes here: https://www.youtube.com/watch?v=SGNAsfsVr6Q

TerminadaPool commented 1 year ago

@AndrewWestberg Thanks. I watch all of your excellent videos and I just watched that one again. I wish I could understand Rust and Haskell better to be more helpful.

But maybe I can help with some investigative work. Here is a query to retrieve the last 5 slot battles where the winner unexpectedly had a higher leader_vrf_0 value:

sqlite3 /var/lib/cardano/cncli.db 'select winner.block_number, winner.slot_number, winner.pool_id, winner.hash, winner.leader_vrf_0, loser.leader_vrf_0 from chain winner inner join chain loser on winner.slot_number = loser.slot_number and winner.block_number = loser.block_number and winner.orphaned=0 and loser.orphaned=1 and winner.leader_vrf_0 > loser.leader_vrf_0 order by winner.block_number desc limit 5;'

Here is what my logs show for the latest slot battle:

Sep 30 07:33:03 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Notice:169] [2022-09-29 21:33:03.70 UTC] Chain extended, new tip: 7e509ed514b0a5ca12cbc2efa7b9b2a3c995f3fecd9cd77a3577b3b6edcc89d9 at slot 72920892
Sep 30 07:33:03 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Info:169] [2022-09-29 21:33:03.79 UTC] Block fits onto some fork: 9ae2d5b41718c4f132e59d2450a12455c89ab7b3ece15cd28bd6ac364cd7b742 at slot 72920892

Sep 30 07:33:03 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Notice:169] [2022-09-29 21:33:03.82 UTC] Switched to a fork, new tip: 9ae2d5b41718c4f132e59d2450a12455c89ab7b3ece15cd28bd6ac364cd7b742 at slot 72920892

And the second latest slot battle:

Sep 30 07:26:52 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Notice:169] [2022-09-29 21:26:52.22 UTC] Chain extended, new tip: 42a021dddc771a4b210840e0bf8e4270f1c4fda769a0813e0ea3f72a44fe805c at slot 72920520
Sep 30 07:26:52 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Info:169] [2022-09-29 21:26:52.44 UTC] Block fits onto some fork: 5e636930f202839fcddf0447e4df28a65d590439fb8ea76c2a27652a8e6b3861 at slot 72920520

Sep 30 07:26:52 relay1 cardano-node[995341]: [relay1:cardano.node.ChainDB:Notice:169] [2022-09-29 21:26:52.53 UTC] Switched to a fork, new tip: 5e636930f202839fcddf0447e4df28a65d590439fb8ea76c2a27652a8e6b3861 at slot 72920520

In both cases the second block was received only a few milliseconds (9,22) later and it was immediately preferred despite this second block having a higher leader_vrf_0 value in the cncli database.

Possible explanations:

  1. The leader_vrf_0 values stored in cncli database are not the same as the leader vrf values the node is using for comparison.
  2. The node is not doing a leader vrf comparison and instead preferring the second block received for some other reason. Which would be weird and possibly bad!

Sorry to give you an extra concern to look at when you are busy.

TerminadaPool commented 1 year ago

I looked at logs on my relay3, which is on the other side of the world. It happened to receive the latest slot battle blocks in the opposite order to my relay1. And, it didn't extend its tip with the second block. Notice the difference between the logs below and the first output in my previous comment which logs the same blocks.

Sep 30 07:33:03 relay3 cardano-node[1012155]: [relay3:cardano.node.ChainDB:Notice:306] [2022-09-29 21:33:03.33 UTC] Chain extended, new tip: 9ae2d5b41718c4f132e59d2450a12455c89ab7b3ece15cd28bd6ac364cd7b742 at slot 72920892

Sep 30 07:33:03 relay3 cardano-node[1012155]: [relay3:cardano.node.ChainDB:Info:306] [2022-09-29 21:33:03.92 UTC] Block fits onto some fork: 7e509ed514b0a5ca12cbc2efa7b9b2a3c995f3fecd9cd77a3577b3b6edcc89d9 at slot 72920892

But the tip wasn't extended with that 7e509 block.

Then the chain moves on with next block as usual

Sep 30 07:33:05 relay3 cardano-node[1012155]: [relay3:cardano.node.BlockFetchDecision:Info:316] [2022-09-29 21:33:05.15 UTC] [TraceLabelPeer (ConnectionId {localAddress = 144.126.157.46:42419, remoteAddress = 54.93.82.235:3001}) (Right [At (Block {blockPointSlot = SlotNo 72920894, blockPointHash = a9bf4bca57148425216eed70484083c8c2e8996a5ab>
Sep 30 07:33:05 relay3 cardano-node[1012155]: [relay3:cardano.node.ChainDB:Notice:306] [2022-09-29 21:33:05.30 UTC] Chain extended, new tip: a9bf4bca57148425216eed70484083c8c2e8996a5ab6a9f9d33e4fce7ebf0bad at slot 72920894

Block 7e509... was received 59ms after 9ae2d... but relay3 did not adopt block 7e509... So the order of block receipt doesn't matter. In both relays block 9ae2d was preferred. I must presume that both these nodes preferred block 9ae2d because its leader vrf score was actually lower. Whereas the cncli database shows block 7e509 as having the lower leader_vrf_0 score:

sqlite3 /var/lib/cardano/cncli.db 'select block_number, slot_number, hash, leader_vrf_0, orphaned from chain where slot_number=72920892;'

7821110|72920892|7e509ed514b0a5ca12cbc2efa7b9b2a3c995f3fecd9cd77a3577b3b6edcc89d9|000037a19fc8c185e59e66572121bd8697c254726ccb89c44e151388dc36112e|1
7821110|72920892|9ae2d5b41718c4f132e59d2450a12455c89ab7b3ece15cd28bd6ac364cd7b742|0005141d6304f92bd4a189b79d93dced2b5dfe5daff3af2b8e38cd9f11c59c94|0

Therefore the likely explanation is:

  1. leader_vrf_0 values stored in cncli database are not the same as the correct leader vrf values nodes are using for comparison.
AndrewWestberg commented 1 year ago

@TerminadaPool There is nothing wrong in cncli. I believe you've actually found a bug in cardano-node. It appears that the node is incorrectly using the block_vrf_0 value for determining who wins the slot battle instead of the leader_vrf_0 value.

This is incorrect because it removes the slight advantage that small pools had in slot leadership and instead makes it entirely random.

I queried my cncli db in a similar manner as yours and determined a fairly random pattern of winners when I looked at leader_vrf_0. Out of the 30 most recent slot battles queried, 13 were showing the higher leader_vrf_0 value winning.

When you compare the block_vrf_0 values for all 30, they are perfectly in line with the lower block_vrf_0 winning each time.

AndrewWestberg commented 1 year ago

@TerminadaPool I'll reach out to IOG to see what they want to do about this.

AndrewWestberg commented 1 year ago

@TerminadaPool Please create a github issue here and link this ticket in it. https://github.com/input-output-hk/ouroboros-network/issues

AndrewWestberg commented 1 year ago

Bug confirmed by Jared from IOG. In Alonzo and earlier. the leader_vrf value was being used: https://github.com/input-output-hk/ouroboros-network/blob/9249a70ed9e2365f3963e47cb31b4b1589bca8f6/ouroboros-consensus-shelley/src/Ouroboros/Consensus/Shelley/Protocol/TPraos.hs#L111

In Babbage, the raw vrf value is being passed instead of calculating the leader_vrf from it. https://github.com/input-output-hk/ouroboros-network/blob/9249a70ed9e2365f3963e47cb31b4b1589bca8f6/ouroboros-consensus-shelley/src/Ouroboros/Consensus/Shelley/Protocol/Praos.hs#L141

papacarp commented 1 year ago

watch it in action: https://www.youtube.com/shorts/pXGQQmCneM4

JaredCorduan commented 1 year ago

Now that I've looked into this, I know enough to be able to make an issue myself.

TerminadaPool commented 1 year ago

Apologies to @AndrewWestberg for thinking the bug was in cncli.

AndrewWestberg commented 1 year ago

@TerminadaPool no need to apologize whatsoever. You found a bug. This will forever be a contribution to improving cardano that you have made.

TerminadaPool commented 1 year ago

In case this "bug" was instead a design decision: I pointed out that leader VRF value is not possible to game whereas block VRF can be manipulated by custom software doing transaction selecting.

https://github.com/input-output-hk/ouroboros-network/issues/4051#issuecomment-1264211740

TerminadaPool commented 1 year ago

I have been corrected by @JaredCorduan

the value of the VRF in the block header depends on:

  • the epoch nonce
  • the slot
  • the private VRF key
JaredCorduan commented 1 year ago

I hope that "corrected" reads like "helped" to most folks! None of this is obvious, there is a lot of technical detail here which is not easy to discover, and my hope is that I can empower y'all.

AndrewWestberg commented 1 year ago

Closing this issue on the cncli side for now. Thanks for reporting @TerminadaPool and thanks @JaredCorduan for taking the ball.