confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
194 stars 3.14k forks source link

Throughput decrease badly while broker number increase. #3626

Open eelyaj opened 2 years ago

eelyaj commented 2 years ago

Description

I have a 100 partiation topic in my system. I found that when there are 3 kafka brokers, librdkafka can send 1,500,000 packets per second to kafka. But when I increase broker number from 3 to 20, librdkafka can only send 68,0000 packets per second.

I use rd_kafka_produce_batch api in my producer, with parmeters partition=RD_KAFKA_PARTITION_UA, msgflags=RD_KAFKA_MSG_F_COPY, message_cnt= 1000.

The 'top' cpu output is like: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7435 root 21 1 1880768 1.2g 7744 R 59.7 3.7 9:58.51 KafkaProducer 7453 root 22 2 1880768 1.2g 7744 R 33.0 3.7 5:17.44 Serializer2 7452 root 22 2 1880768 1.2g 7744 R 32.7 3.7 5:16.51 Serializer1 7451 root 22 2 1880768 1.2g 7744 R 31.7 3.7 5:12.27 Serializer0 7018 root 20 0 1880768 1.2g 7744 S 13.2 3.7 1:56.81 rdk:broker15000 7021 root 20 0 1880768 1.2g 7744 S 10.2 3.7 1:39.75 rdk:broker15000 7027 root 20 0 1880768 1.2g 7744 S 9.2 3.7 1:36.97 rdk:broker15000 7454 root 23 3 1880768 1.2g 7744 R 8.6 3.7 1:20.87 UdpDispatch 7025 root 20 0 1880768 1.2g 7744 S 8.3 3.7 1:29.82 rdk:broker15000 7033 root 20 0 1880768 1.2g 7744 S 8.3 3.7 1:36.80 rdk:broker15000 7019 root 20 0 1880768 1.2g 7744 S 7.9 3.7 1:07.05 rdk:broker15000 7030 root 20 0 1880768 1.2g 7744 S 7.9 3.7 1:18.41 rdk:broker15000 7020 root 20 0 1880768 1.2g 7744 R 7.6 3.7 1:13.25 rdk:broker15000 7036 root 20 0 1880768 1.2g 7744 R 7.6 3.7 0:55.21 rdk:broker15000 7024 root 20 0 1880768 1.2g 7744 S 6.9 3.7 0:55.41 rdk:broker15000 7035 root 20 0 1880768 1.2g 7744 S 6.9 3.7 1:02.24 rdk:broker15000 7028 root 20 0 1880768 1.2g 7744 S 6.3 3.7 0:55.50 rdk:broker15000 7034 root 20 0 1880768 1.2g 7744 S 5.6 3.7 0:54.06 rdk:broker15000 7026 root 20 0 1880768 1.2g 7744 S 5.3 3.7 0:42.09 rdk:broker15000

The 'KafkaProducer' thread is where I use rd_kafka_produce_batch api to send message to kafka. I also use 'perf top' to show cpu info of 'KafkaProducer': Samples: 17K of event 'cycles', Event count (approx.): 2675618158 lost: 0/0 Children Self Shared Object Symbol

the ‘write’ system calls are in the top list.

How to reproduce

Deploy large nubmer of kafka broker, it reproduce everytime in my system. I've tried 1.1.0, 1.7.0 version.

Checklist

Please provide the following information:

eelyaj commented 2 years ago

There are lots of io event write in thread 'KafkaProducer'. I don't know if it is normal.

Thread 16 "KafkaProducer" hit Breakpoint 1, 0x00007ffff6553fb0 in write () from /usr/lib64/libpthread.so.0 (gdb) bt

0 0x00007ffff6553fb0 in write () from /usr/lib64/libpthread.so.0

1 0x00007ffff7da8f3b in rd_kafka_q_io_event (rkq=0x14646e0) at rdkafka_queue.h:324

2 rd_kafka_q_yield (rkq=0x14646e0) at rdkafka_queue.h:363

3 rd_kafka_toppar_enq_msg (rktp=, rkm=) at rdkafka_partition.c:712

4 0x00007ffff7d5038b in rd_kafka_msg_partitioner (rkt=rkt@entry=0x1441800, rkm=rkm@entry=0x2e1f2700, do_lock=do_lock@entry=RD_DONT_LOCK) at rdkafka_msg.c:1282

5 0x00007ffff7d51d3d in rd_kafka_produce_batch (app_rkt=, partition=-1, msgflags=, rkmessages=, message_cnt=) at rdkafka_msg.c:781

edenhill commented 2 years ago

I think this might be a dup of https://github.com/edenhill/librdkafka/issues/3538

I'm working on an improved wakeup mechanism for 1.9.

eelyaj commented 2 years ago

Thanks. Is there any workround I can do to avoid this issue, any config or parameters? or maybe I can roll back librdkafka version in my system, which version should I use?

eelyaj commented 2 years ago

Test with 100 partitions , 3 brokers, 4 vcpu, 16G men, 60 bytes packet.

version throughput(packets/second)
1.8.2 800,000
1.7.0 800,000
1.6.1 1,000,000
1.5.3 790,000
1.4.4 790,000
1.3.0 760,000
1.2.2 790,000
1.1.0 790,000
0.11.6 1,020,000
anchitj commented 1 month ago

@eelyaj Is this still an issue?

anchitj commented 1 month ago

Closing as the fix is merged already. Feel free to reopen if you still see the issue.