AshwinSekar commented 2 months ago

Problem

Given a cluster of fully performant validators, we would expect there to be 1 vote per validator in each block. This is not what we observe in practice. There is also a reduction of votes from epochs prior to 578 and current epochs. Similarly, there is a large discrepancy in landed vote transactions for each leader slot.

Analysis

A sample of 10k slots from epochs 577 and 628 show that there are less vote txs per block:

                      | Epoch 577  249303455 to 249313455 | Epoch 628  271289183 to 271299183
Total vote txs        | 9325696                           | 8703855
Avg vote txs per slot | 932.5696                          | 870.3855

Interestingly, we see in 628 there are more vote transactions landing in the first leader slot, while there are less for the second and third leader slot. There is a sharp decline in the 4th leader slot that is consistent for both 577 and 628

Breaking this down by latency, we see that there are less latency 1 votes (votes for the immediately previous slot) in general. The 4th leader slot has very few latency 1 votes in comparison:

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to: https://github.com/anza-xyz/agave/blob/5263c9d61f3af060ac995956120bef11c1bbf182/sdk/program/src/clock.rs#L141 This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

Another explanation for overall poor tx inclusion is related to forks. If the next leader chooses to build off a different parent, votes for the previous leaders slots will fail the slot hashes check. As we've modified banking stage to only hold the latest vote per validator, even if there was a previous vote for a parent that has not been included in this fork, it has no chance to land. Also since we only send vote tx to the tpu of the slot + 2 leader, these votes can only land through gossip for the next leader that decides to build off of the main fork.

We can see this in action here, as slots 271409560 - 63 was a minor fork, meaning any votes for 58-59 could have only landed through gossip on 64.

Solutions

Allow votes for the 3rd leader slot to be sent to both the current leader and next leader through tpu.
Rework the banking stage vote ingestion, to be smarter about the slot hashes check and continue to hold onto votes that do not pass the check for the current working bank, as a leader could switch forks during there 4 leader slots.

Another possibility is that replay is not keeping up and vote transactions are not being sent in time for inclusion. I will follow up with some more individual vote timing metrics.

AshwinSekar commented 2 months ago

Restricting the sample to only rooted blocks, slightly improves vote numbers but not by much: Screen Shot 2024-06-25 at 12 41 25 PM

AshwinSekar commented 2 months ago

To analyze replay can look at the ~ 2050 validators that report metrics to see when the vote tx for slot S was created. For the purpose of this example we consider that the vote tx for slot S will land in S + 1 if it was created before the my_leader_slot metric for S + 2 minus 100 ms to account for latency. Here are results from the previous small sample range in epoch 628: Screen Shot 2024-06-25 at 1 24 16 PM This does not lineup with the # of latency 1 votes scraped from ledger for the same range: Screen Shot 2024-06-25 at 1 31 07 PM

Some slot ranges such as 53 - 55 show a drastic difference in votes that were expected to land and what actually landed.

NOTE: the replay results for the 4th leader slot will be skewed for the tpu issue mentioned previously.

AshwinSekar commented 2 months ago

Not too much vote deduplication during this range Screen Shot 2024-06-25 at 2 08 35 PM

AshwinSekar commented 2 months ago

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to:

https://github.com/anza-xyz/agave/blob/5263c9d61f3af060ac995956120bef11c1bbf182/sdk/program/src/clock.rs#L141

This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

When testing this out in practice, it seems that the vote can be sent to the leader at a 3 slot offset or more. next_leader uses the poh_recorder which is based on the last reset bank, not necessarily in sync with the vote_bank. Adding tpu logging to local_cluster::test_spend_and_verify_all_nodes_3 (3 node cluster, no intentional forking) we see this is the case:

Leader Slot	Total votes	# of votes sent to leader + 2	# of votes sent to leader + 3	# of votes sent to a leader that could not land it in latency 1
1	58	38	20
2	56	40	16	16
3	58	41	17	58
4	58	36	22

In the presence of forks, the reset bank could be on a completely different fork than the vote bank, causing even poorer inclusion.

AshwinSekar commented 2 months ago

Ran a small patch to log which slot's leader we would send a vote tx to against mainnet (non voting validator), we see a larger range of the desync between the slot selected through the poh_recorder and the vote_slot. The columns are slot who's leader we sent the vote to - vote_slot

Leader Slot	-6	-5	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	Total
1	1		12	17	2	3	698	8595	2276	711	581	191	82	21	5	3	2	2	13202
2		2	9	17	2	28	4481	8998	81	18	18	3		1					13658
3			1	2		2	5493	8153	32	3									13686
4							5423	8243	23	3									13692
Total	1	2	22	36	4	33	16095	33989	2412	735	599	194	82	22	5	3	2	2	54238

Note that there is a larger amount of variability for the earlier leader slots, and it seems like the reset bank converges during the final leader slot. This is also gives us a smaller # of vote txs that could land with latency 1:

Leader Slot	# vote txs not sent to the leader of the next slot	# vote txs sent to the leader of the next slot	% of votes sent to the wrong leader
1	3904	9298	29.57 %
2	9130	4528	66.84 %
3	13681	5	99.96 %
4	3	13689	0.02 %
Total	26718	27520	49.26 %

This means that of the ~55k slots voted on, 49% of them were sent to a leader which ensured that the vote could not land in the next slot without the assistance of forwarding or gossip.

This could just mean that replay is not able to keep up half of the time. Will follow up with more replay metrics.

AshwinSekar commented 3 weeks ago

Linking https://github.com/anza-xyz/agave/pull/2607 (send to poh_slot + 1 and poh_slot + 2)

Also https://github.com/anza-xyz/agave/pull/2605 fixes a bug in retryable vote packets which will improve inclusion

Edit: also efforts in here https://github.com/anza-xyz/agave/issues/2183 should slightly improve inclusion during forks

StaRkeSolanaValidator commented 3 weeks ago

Hi @AshwinSekar. Is there any plan to backfill vote txs after we failover to the heaviest fork? I understand that would increase vote inclussion as well, even though I'm not sure if that would help for the consensus. Thanks!

AshwinSekar commented 3 weeks ago

It doesn't increase vote inclusion in this context, as you can't retroactively add votes to blocks that have already been produced. It is risky to backfill, as you are artificially increasing your lockout on whatever fork you choose to backfill. If for whatever reason you need to switch off this fork, you will have to wait longer.

AshwinSekar commented 1 week ago

2607 and #2605 are present in v2.0.7 which has 63% adoption on testnet in epoch 683. Here's a comparison with some prior epochs 671 & 672 (2.0.4).

Note: These numbers are for testnet, and should not be compared to the mainnet graphs above. Also these numbers include vote transactions from firedancer which has approximately 27% stake: Screen Shot 2024-08-28 at 5 49 22 PM

Epoch 671	671 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	1,248.66	601.20	543.42	4.51	99.52
2	632.89	550.98	42.31	16.32	23.29
3	682.71	519.40	147.90	2.91	12.50
4	209.01	31.33	145.77	8.77	23.14


Epoch 672	672 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	1,245.18	603.35	541.97	3.95	95.91
2	615.44	552.21	38.76	5.72	18.75
3	680.63	520.67	147.24	2.47	10.24
4	207.94	30.78	145.57	8.71	22.89

Epoch 683	683 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	959.76	596.19	260.96	3.68	98.92
2	556.30	525.29	18.80	3.13	9.08
3	679.25	503.99	165.15	2.90	7.21
4	596.80	382.95	191.43	6.75	15.67

We have a huge increase in votes (and latency 1 votes specifically) for the 4th leader slot. I believe this can be attributed to #2607 sending to poh_slot + 1 🎉 .

anza-xyz / agave

Vote txs per block are less than 1 vote per validator #1851

Problem

Analysis

Solutions

2607 and #2605 are present in v2.0.7 which has 63% adoption on testnet in epoch 683. Here's a comparison with some prior epochs 671 & 672 (2.0.4).