confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
278 stars 3.15k forks source link

Producer CPU usage #3692

Open larry-cdn77 opened 2 years ago

larry-cdn77 commented 2 years ago

Cluster unavailable (eg network partition) for 15 minutes or so can put an idempotent producer into a state of excessive CPU usage. It can take several minutes after cluster restart for CPU usage to climb up and it tends to come and go in bursts of a few minutes at a time as one example graph demonstrates.

CPU

What it needs:

Log attached up to level 6. I have level 7 for searching (200GB). Graphs of frequency by facility is attached that might be useful. I have a made a gprof file, which I attach too.

Further, attached is a minimal C producer snippet that I use to demonstrate the problem, albeit at smaller scale (production setup has 64 producers). In there are also specific configuration lines. This snippet runs at 1-2% CPU usage on a Silver Xeon core but 30-40% after cluster outage.

Checklist:

Any thoughts would be greatly appreciated

worker.c.txt debug.log.gz controller.log.gz gmon.out.gz

newplot-5

larry-cdn77 commented 2 years ago

As far as I have been able to tell, the excessive CPU usage is of two kinds.

1. Fast leader queries. Something happens with metadata for one broker or with the broker itself. Produce requests continuously fail with NOT_LEADER_FOR_PARTITION. With 320 partitions quite a few produce requests accumulate. I have seen hundreds a second. A failed request looks to see if a fast leader query should be made and—presumably due to #3690—always makes one, which turns into a metadata request. Partition leader queries override the mechanism for skipping repeat metadata requests (so they can respond to cluster events quicker) and therefore loads end up sent with a chunk of CPU time spent processing replies (replies apply to all partitions).

As a workaround, I lower the rate of fast leader queries, maybe as much as:

topic.metadata.refresh.fast.interval.ms 15000

I did not yet understand how the broker error sustains or repeats itself, or what it is that seems to happen on the one broker. It is not a transient event, however. In the one case I have detailed logs for, it went on for half an hour until I restarted the producer.

2. Broker wake-ups. Somehow, the same broker thread that creates the metadata storm in the previous point also repeatedly enters timeout scanning. The message rate is non-trivial and each scan finds a timed-out message. With idempotence, the broker serve is now supposed to drain in-flight produce requests and issue a wake-up. There end up being so many wake-ups as much as 30% CPU is consumed sort-inserting them into ops queue. I am seeing close to 5 x 320 each serve period (is that 1 second?), where 5 is my Kafka cluster size. Later, as if caused by the wake-ups themselves, other broker threads enter cycles of timeout scanning. So multiply the wake-up rate by 5 once more.

As a naive mitigation, I issue the broker wake-up op out of order directly to head of queue, as if a higher priority than flash (and reverse order):

  void rd_kafka_broker_wakeup (rd_kafka_broker_t *rkb) {
           rd_kafka_op_t *rko = rd_kafka_op_new(RD_KAFKA_OP_WAKEUP);
-          rd_kafka_op_set_prio(rko, RD_KAFKA_PRIO_FLASH);
-          rd_kafka_q_enq(rkb->rkb_ops, rko);
+          rd_kafka_q_enq1(rkb->rkb_ops, rko, rkb->rkb_ops, 1, 1);
           rd_rkb_dbg(rkb, QUEUE, "WAKEUP", "Wake-up");
   }

I have not tested this change thoroughly. Particularly not with consumer. Would separate queues or binary heap be a good alternative implementation of #1088 to remove the need for sorted insert?

Despite raising more questions than answers, I chose to submit this update as the two workarounds mean I can re-deploy my producer in production. I no longer see dramatic CPU spikes.

larry-cdn77 commented 2 years ago

On further investigation and code reading, broker wake-ups as per point 2 above only seem to spike CPU for a few minutes following cluster shutdown until metadata expires and broker-assigned toppars stop seeing message timeouts that make them generate drain-bump-wakeup events. Let me elaborate on the sequence of events. We are not restarting the cluster yet, just stop it and wait.

The attached graph shows brokers 1-4 eating through their wakeups on each connection timeout while broker 5 just accumulates wakeups. At 15:00 is cluster stop.

The trouble with 100,000 wake-ups—as I eluded to earlier—is that their sorted insert is effectively quadratic complexity. I appreciate that inserting wake-ups out of order is hacky, and wonder if, apart from using separate priority queues or a priority heap, it would work to only do a wake-up after the entire topic scan in rd_kafka_broker_produce_toppars. Not for each partition in rd_kafka_toppar_producer_serve. It would certainly mitigate the extra CPU usage. I suppose one dramatic scenario is a flappy network and sporadically refreshed metadata, each time spiking CPU before expiring.

wakeups

Furthermore, I must be missing something when I imagine the all-broker wake-up introduced in 54711c3 be unnecessary because broker thread state machine is serviced at least every 1000ms. A word of education would be wholeheartedly welcome 😄

larry-cdn77 commented 2 years ago

Further on broker wake-ups, I attach a snippet from my production machine where 64 producers run simultaneously. Graph points are second-by-second aggregated counts of the 'timed out' (only broker number 5, the last one to stop, has those) and 'Wake-up' (all 5 brokers) debug messages. The graphs illustrate how message timeouts lead to excessive broker wake-ups. Cluster stop at 18:45, cluster start only later so not visible.

timeouts

wakeups

edenhill commented 2 years ago

Don't know how I missed this issue, but it is pure gold! Great troubleshooting and analysis @larry-cdn77 !

edenhill commented 2 years ago

I just merged a bunch of producer latency fixues to master. Would it be possible for you to try to reproduce this on master?

larry-cdn77 commented 2 years ago

I noticed the recent changes, and would like to try them although cannot be sure when will be the next opportunity

Will certainly post any updates here, thanks!

pranavrth commented 3 months ago

I think the issue is resolved with Magnus changes but I will wait for you to confirm.