anza-xyz / agave

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://www.anza.xyz/
Apache License 2.0
399 stars 197 forks source link

solBanknStgTx00 and solBanknStgTx01 threads peg the CPU #636

Open bji opened 6 months ago

bji commented 6 months ago

On investigation of issues around performance during high packet ingestion rates, I noticed that on all validators I have access to (including two on mainnet and one on testnet; versions 1.17.28, 1.17.28, and 1.18.9 respectively), the threads solBanknStgTx00 and solBanknStgTx01 appear to use very nearly 100% CPU each continuously. Here's a simple process listing:

$ ps -eL | grep BanknS
2363290 2364914 ?        11-23:49:41 solBanknStgTx00
2363290 2364915 ?        11-22:12:47 solBanknStgTx01
2363290 2364916 ?        07:15:09 solBanknStgTx02
2363290 2364917 ?        07:26:36 solBanknStgTx03
2363290 2364918 ?        07:15:41 solBanknStgTx04
2363290 2364919 ?        07:25:15 solBanknStgTx05

By my calculation, the other banking threads use approximately 2.5% the CPU of those first two threads.

solBanknStgTx00 is the gossip vote ingestion thread; and solBanknStgTx01 is the TPU vote ingestion thread. It is not entirely clear what causes this high CPU usage although code analysis would seem to indicate that there is nothing preventing these threads from looping continuously as long as there are unprocessed packets in their packet storage queues.

In addition, on reading the code, it looks to me like any packet which arrives in the 20 slots leading up to the validator's leader slots will invoke the ForwardAndHold logic, and at that point I cannot see any way that the packet would leave the queue until the Consume phase is reached 20 slots later. In the meantime it looks to me like the vote TPU thread will continuously re-evaluate the packets it has Forwarded and Held, repeating the forwarding process over and over again. I cannot see how this doesn't lead to a spew of duplicate forwards out of the leader for all packets received in the 20 slots leading up to its leader slots.

I do not fully understand the code and may be wrong; but the high CPU usage of these threads combined with my evaluations might necessitate some deeper analysis of the code involved.

bji commented 6 months ago

I see now where the re-forwarding of packets is prevented. In LatestUnprocessedVotes::get_and_insert_forwardable_packets, only votes for which is_forwarded() returns false are included.

However the flow of the loop is definitely wasteful; it does an awful lot of work re-evaluating already forwarded packets in a tight loop. I do not know the degree to which this contributes to the high CPU usage.

apfitzge commented 6 months ago

Yeah looking at the vote threads, they do seem wasteful in their usage. I think we could probably make a few simple changes here to reduce CPU usage in short-term:

  1. Gossip thread never does any actual work. Only receives/buffers threads. However it's receive timeout is still chosen the same as the other threads. In PacketReceiver::get_receive_timeout. This should be adjusted so the gossip thread will sleep for a much longer period.
  2. TpuVote thread might keep track of the unforwarded packet count and only loop/process again if there are actually unforwarded packets. If no unforwarded packets, should also try receiving for a longer period of time.

Long-term I think we may be able to consolidate the 2 threads into single thread, also remove the holding period before gossip votes are released to the thread.

steviez commented 6 months ago

FWIW - I think this issue is tied to the old scheduler (default in v1.17 and v1.18); the new scheduler (default in master) looks to have much lower usage when the leader is not near its' leader slots

Above statement is incorrect - see https://github.com/anza-xyz/agave/issues/636#issuecomment-2048195818

apfitzge commented 6 months ago

New scheduler doesn't change the vote threads though, so I'd not expect new scheduler to have any effect here.

steviez commented 6 months ago

New scheduler doesn't change the vote threads though, so I'd not expect new scheduler to have any effect here.

I went back and revisited my data that made me make this claim, and it looks like the time window that I sampled was "abnormal" in that it had lower utilization on these threads. The output below show thread name (comm) and cumulative scheduled time divided by cumulative process time (pcpu).

Tip-of-master with default (new scheduler)

$ ps -p 3753527 -T -o comm,pcpu | grep BanknStg
solBanknStgTx00 99.6
solBanknStgTx01 99.3

Tip-of-master with old scheduler (--block-production-method thread-local-multi-iterator):

$ ps -p 953856 -T -o comm,pcpu | grep Bankn
solBanknStgTx00 99.6
solBanknStgTx01 99.4
solBanknStgTx02  0.4
solBanknStgTx03  0.4
solBanknStgTx04  0.4
solBanknStgTx05  0.5

So yeah, my mistake and agreed with you there

apfitzge commented 6 months ago

743 related to this, kept skipping on a length check.