Closed wuqingjun closed 5 years ago
Those stacktraces don't make any sense (in the librdkafka parts) unfortunately, it looks like maybe the wrong symbol file was used.
What exact librdkafka version were you using (v0.11.5 is not released yet so that's not a thing)? Did you use one of the librdkafka.redist packages?
@edenhill , I corrected the description by updating the call stacks. Please take a look. Sorry for confusion.
We built the library from master branch, that is why the version number shows 0.11.5.
And I believe we're using librdkafka.redist, I am able t find a couple of things like: librdkafka.redist.nuspec/librdkafka.redistr.props/librdkafka.redist.targets in our project.
This issue seems to be intermittent issue. I cannot reproduce it so far, but in our Prod environment we are able to see it still happening. And the dump file still showing almost the the same thing.
What else I can check?
@edenhill , I updated the call stack. Please take a look. Thanks!
Do you set socket.blocking.max.ms
? if so, to what?
Btw, if you are installing the librdkafka.redist nuget package you don't need to build librdkafka yourself, it comes prebuilt.
Are the 4 tracebacks above from the 4 threads with a lot of CPU usage? If so, it is kind of weird because 3 of those threads show that they are in ua_idle, which means they don't really do much.
@edenhill , the socket.blocking.max.ms=1. in terms of the 4 thread call stacks, I don't know the CPU usage of them, the data I provided above is the time the threads has been running, more than 1 hour. The other threads mostly are less than 1 second, here is the data: Thread Time 11:3be4 0 days 1:15:16.234 12:14e8 0 days 1:06:29.593 13:1e1c 0 days 1:04:30.609 10:2cac 0 days 1:02:32.171 0:40ac 0 days 0:00:13.906 3:2378 0 days 0:00:00.500 2:22a0 0 days 0:00:00.093 7:3cc0 0 days 0:00:00.078 8:40d0 0 days 0:00:00.015 4:4634 0 days 0:00:00.015 15:e4 0 days 0:00:00.000 14:4f10 0 days 0:00:00.000 9:3ea8 0 days 0:00:00.000 6:2af0 0 days 0:00:00.000 5:2940 0 days 0:00:00.000 1:32f8 0 days 0:00:00.000 alldump.txt
Are you setting socket.blocking.max.ms=1
?
Are any of the connections to the brokers down?
@edenhill , to your question "Are any of the connections to the brokers down", the answer is yes and no. We encountered this issue multiple times:
The reference #1762 you gave above sounds reasonable to me, because the socket.blocking.max.ms is set with a very low value, which many cause the 30 threads keep sinning. But this is intermittently happening, we have thousands of machines, but around 10 of them getting this kind of problems.
But for the 2. scenario above, the socket.blocking.max.ms=1000, and there' 4 threads only, this is also causing high CPU. And the issue is intermittently happening. Looks like there would be another root cause.
This PR aims to improve on this scenario: https://github.com/edenhill/librdkafka/pull/1870
Hi @edenhill it seems you are working on CPU usage problem. let me add problem on my end
Please provide the following information:
librdkafka version (release number or git tag): master(downloaded on OCT 2018) Apache Kafka version: 1.1.1 librdkafka client configuration: queue.buffering.max.ms", "1", "socket.blocking.max.ms", "100" Operating system: openVOS(Linux like POSIX supported)
As you mentioned in #2143 creating producer objects are costly. still, I am using the same mechanism (create/destroy on each request). I am checking on if it does create "increasing CPU usage" problem
On top of librdkafka, we have our own threads which will pickup request from the queue and produce message by calling librdkakfka API. it seems normal while no request on the queue. when we put a request on the queue, gradually CPU usage has been increasing on each request, and at some level reaches up to 100% CPU usage even by single thread.
I have used rdkafka_simple_producer.c we can put whole code in while(1) and produce particular message on Kafka server. note here that for now we do recreate/destroy objects on each request.
Thanks JD
openVOS is not a supported platform, none of the librdkafka maintainers have ever tried running on openVOS and it is very likely that subtle differences in threads, libc, posix, et.al, may have CPU effects.
Secondly, it is strongly recommended to reuse the producer object, creating a new producer for each message is typically too costly to be practical, and really does not serve much purpose unless the messaging interval is longer than a couple of minutes between messages.
@edenhill thank you! I know openVOS isn't supported, though I am trying. I will dig into internal implementation if I can resolve problems I am changing stuff to reuse objects. Thanks, man!
Hello I have got the similar case, here are the data:
Symptom: 100% CPU Utilization.
2.0.0
2019-03-01 05:37:31.655 INFO [9468] [main@192] Producer is configured: Topic: prd.cex.gz2.trd.ug; Config: api.version.fallback.ms: 1200000;api.version .request: true;api.version.request.timeout.ms: 10000;auto.commit.interval.ms: 5000;batch.num.messages: 1;broker.address.family: any;broker.address.ttl: 1000;broker.version.fallback: 0.9.0;builtin.features: gzip,snappy,ssl,sasl,regex,lz4,sasl_gssapi,sasl_plain,sasl_scram,plugins;check.crcs: false;clien t.id: rdkafka;compression.codec: none;coordinator.query.interval.ms: 600000;debug: broker,topic;default_topic_conf: 0x5565b968bfb0;delivery.report.only .error: false;enable.auto.commit: true;enable.auto.offset.store: true;enable.partition.eof: true;enabled_events: 0;fetch.error.backoff.ms: 500;fetch.ma x.bytes: 52428800;fetch.message.max.bytes: 1048576;fetch.min.bytes: 1;fetch.wait.max.ms: 100;group.protocol.type: consumer;heartbeat.interval.ms: 300;i nternal.termination.signal: 0;log.connection.close: false;log.queue: false;log.thread.name: true;log_cb: 0x7f1ac5ad0580;log_level: 6;max.in.flight.requ ests.per.connection: 1000000;message.copy.max.bytes: 65535;message.max.bytes: 1000000;message.send.max.retries: 2;metadata.broker.list: 192.168.10.24:9 092,192.168.10.25:9092,192.168.10.26:9092;metadata.max.age.ms: -1;metadata.request.timeout.ms: 60000;offset.store.method: broker;open_cb: 0x7f1ac5af6f1 0;partition.assignment.strategy: range,roundrobin;queue.buffering.backpressure.threshold: 1;queue.buffering.max.kbytes: 1048576;queue.buffering.max.mes sages: 100000;queue.buffering.max.ms: 0;queued.max.messages.kbytes: 1048576;queued.min.messages: 100000;receive.message.max.bytes: 100000000;reconnect. backoff.jitter.ms: 500;retry.backoff.ms: 100;sasl.kerberos.kinit.cmd: kinit -S "%{sasl.kerberos.service.name}/%{broker.name}" -k -t "%{sasl.kerberos.ke ytab}" %{sasl.kerberos.principal};sasl.kerberos.min.time.before.relogin: 60000;sasl.kerberos.principal: kafkaclient;sasl.kerberos.service.name: kafka;s asl.mechanisms: GSSAPI;security.protocol: plaintext;session.timeout.ms: 30000;socket.blocking.max.ms: 1000;socket.keepalive.enable: true;socket.max.fai ls: 1;socket.nagle.disable: false;socket.receive.buffer.bytes: 0;socket.send.buffer.bytes: 0;socket.timeout.ms: 60000;socket_cb: 0x7f1ac5ae4890;statist ics.interval.ms: 0;topic.metadata.refresh.fast.cnt: 10;topic.metadata.refresh.fast.interval.ms: 250;topic.metadata.refresh.interval.ms: 300000;topic.me tadata.refresh.sparse: true;; Builder topic:
Consumer diffs:
[producer.common] metadata.broker.list=192.168.10.24:9092,192.168.10.25:9092,192.168.10.26:9092 debug=broker,topic acks=1 batch.size=1 batch.num.messages=1 linger.ms=0
max.request.size=10
enable.idempotance=1
enable.gapless.guarantee=1
partition=-1
[consumer.common] metadata.broker.list=192.168.10.24:9092,192.168.10.25:9092,192.168.10.26:9092 debug=consumer,topic batch.size=1 batch.num.messages=10 partition=-1 group=me2 enable.auto.commit=0
Ubuntu 18.04
y %4|1551416433.504|REQTMOUT|rdkafka#producer-4| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued , 0 out-queue, 0 partially-sent requests %7|1551416433.505|REQERR|rdkafka#producer-4| [thrd:main]: 192.168.10.26:9092/bootstrap: MetadataRequest failed: Local: Timed out in queue: actions Retr y %7|1551416434.316|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.12 partition 0 Leader 3 %7|1551416434.316|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.11 partition 0 Leader 2 %7|1551416434.316|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.13 partition 0 Leader 2 %7|1551416434.316|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.10 partition 0 Leader 2 %7|1551416434.316|METADATA|rdkafka#producer-3| [thrd:main]: prod-kafka-instance-3:9092/3: 4/4 requested topic(s) seen in metadata %7|1551416434.316|METADATA|rdkafka#producer-1| [thrd:main]: Topic prd.cex.bz2.trd.ug.12 partition 0 Leader 2 %7|1551416434.316|METADATA|rdkafka#producer-1| [thrd:main]: Topic prd.cex.bz2.trd.ug.11 partition 0 Leader 1 %7|1551416434.316|METADATA|rdkafka#producer-1| [thrd:main]: Topic prd.cex.bz2.trd.ug.13 partition 0 Leader 3 %7|1551416434.316|METADATA|rdkafka#producer-1| [thrd:main]: Topic prd.cex.bz2.trd.ug.10 partition 0 Leader 2 %7|1551416434.316|METADATA|rdkafka#producer-1| [thrd:main]: prod-kafka-instance-1:9092/1: 4/4 requested topic(s) seen in metadata %7|1551416434.322|BROKERFAIL|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: failed: err: Local: Broker transpor t failure: (errno: Connection reset by peer) %7|1551416434.322|FAIL|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Disconnected (after 1199973ms in state UP ) %7|1551416434.322|STATE|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Broker changed state UP -> DOWN %7|1551416434.322|CONNECT|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: broker in state DOWN connecting %7|1551416434.322|CONNECT|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Connecting to ipv4#192.168.10.25:9092 (plaintext) with socket 91 %7|1551416434.322|STATE|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Broker changed state DOWN -> CONNECT %7|1551416434.322|CONNECT|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Connected to ipv4#192.168.10.25:9092 %7|1551416434.322|CONNECTED|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Connected (#616) %7|1551416434.322|STATE|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Broker changed state CONNECT -> APIVERSI ON_QUERY %7|1551416434.322|METADATA|rdkafka#producer-4| [thrd:main]: Topic prd.cex.gz2.trd.ug.12 partition 0 Leader 1 %7|1551416434.322|METADATA|rdkafka#producer-4| [thrd:main]: Topic prd.cex.gz2.trd.ug.11 partition 0 Leader 3 %7|1551416434.322|METADATA|rdkafka#producer-4| [thrd:main]: Topic prd.cex.gz2.trd.ug.13 partition 0 Leader 3 %7|1551416434.322|METADATA|rdkafka#producer-4| [thrd:main]: Topic prd.cex.gz2.trd.ug.10 partition 0 Leader 1 %7|1551416434.322|METADATA|rdkafka#producer-4| [thrd:main]: prod-kafka-instance-2:9092/2: 4/4 requested topic(s) seen in metadata %7|1551416434.322|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.12 partition 0 Leader 3 %7|1551416434.322|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.11 partition 0 Leader 2 %7|1551416434.322|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.13 partition 0 Leader 2 %7|1551416434.322|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.10 partition 0 Leader 2 %7|1551416434.322|METADATA|rdkafka#producer-3| [thrd:main]: prod-kafka-instance-1:9092/1: 4/4 requested topic(s) seen in metadata %7|1551416434.322|STATE|rdkafka#producer-3| [thrd:192.168.10.25:9092/bootstrap]: 192.168.10.25:9092/bootstrap: Broker changed state APIVERSION_QUERY -> UP %7|1551416434.323|METADATA|rdkafka#producer-2| [thrd:main]: Topic prd.cex.gz1.trd.ug.12 partition 0 Leader 2 %7|1551416434.323|METADATA|rdkafka#producer-2| [thrd:main]: Topic prd.cex.gz1.trd.ug.11 partition 0 Leader 3 %7|1551416434.323|METADATA|rdkafka#producer-2| [thrd:main]: Topic prd.cex.gz1.trd.ug.13 partition 0 Leader 2 %7|1551416434.323|METADATA|rdkafka#producer-2| [thrd:main]: Topic prd.cex.gz1.trd.ug.10 partition 0 Leader 3 %7|1551416434.323|METADATA|rdkafka#producer-2| [thrd:main]: prod-kafka-instance-1:9092/1: 4/4 requested topic(s) seen in metadata %7|1551416434.392|BROKERFAIL|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: failed: err: Local: Broker transpor t failure: (errno: Connection reset by peer) %7|1551416434.392|FAIL|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Disconnected (after 1200042ms in state UP ) %7|1551416434.392|STATE|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state UP -> DOWN %7|1551416434.392|CONNECT|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: broker in state DOWN connecting %7|1551416434.392|CONNECT|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connecting to ipv4#192.168.10.24:9092 (plaintext) with socket 101 %7|1551416434.392|STATE|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state DOWN -> CONNECT %7|1551416434.393|BROKERFAIL|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: failed: err: Local: Broker transpor t failure: (errno: Connection reset by peer) %7|1551416434.393|CONNECT|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connected to ipv4#192.168.10.24:9092 %7|1551416434.393|CONNECTED|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connected (#603) %7|1551416434.393|FAIL|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Disconnected (after 2400065ms in state UP ) %7|1551416434.393|BROKERFAIL|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: failed: err: Local: Broker transpor t failure: (errno: Connection reset by peer) %7|1551416434.393|STATE|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state UP -> DOWN %7|1551416434.393|FAIL|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Disconnected (after 1200043ms in state UP ) %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Broker changed state UP -> DOWN %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: broker in state DOWN connecting %7|1551416434.393|BROKERFAIL|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: failed: err: Local: Broker transpor t failure: (errno: Connection reset by peer) %7|1551416434.393|FAIL|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Disconnected (after 600019ms in state UP) %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state UP -> DOWN %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Connecting to ipv4#192.168.10.26:9092 (plaintext) with socket 37 %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Broker changed state DOWN -> CONNECT %7|1551416434.393|STATE|rdkafka#producer-3| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state CONNECT -> APIVERSI ON_QUERY %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: broker in state DOWN connecting %7|1551416434.393|CONNECT|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: broker in state DOWN connecting %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Connected to ipv4#192.168.10.26:9092 %7|1551416434.393|CONNECTED|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Connected (#616) %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.26:9092/bootstrap]: 192.168.10.26:9092/bootstrap: Broker changed state CONNECT -> APIVERSI ON_QUERY %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connecting to ipv4#192.168.10.24:9092 (plaintext) with socket 49 %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state DOWN -> CONNECT %7|1551416434.393|CONNECT|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connecting to ipv4#192.168.10.24:9092 (plaintext) with socket 69 %7|1551416434.393|STATE|rdkafka#producer-1| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state DOWN -> CONNECT %7|1551416434.393|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.12 partition 0 Leader 3 %7|1551416434.393|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.11 partition 0 Leader 2 %7|1551416434.393|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.13 partition 0 Leader 2 %7|1551416434.393|METADATA|rdkafka#producer-3| [thrd:main]: Topic prd.cex.dz1.trd.ug.10 partition 0 Leader 2 %7|1551416434.393|CONNECT|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connected to ipv4#192.168.10.24:9092 %7|1551416434.393|CONNECTED|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Connected (#605) %7|1551416434.393|STATE|rdkafka#producer-2| [thrd:192.168.10.24:9092/bootstrap]: 192.168.10.24:9092/bootstrap: Broker changed state CONNECT -> APIVERSI ON_QUERY
Note: I can do archive, on request
Working on this
gcore:
[Current thread is 1 (Thread 0x7f62fbfadf40 (LWP 23269))]
(gdb) info threads
Id Target Id Frame
- 1 Thread 0x7f62fbfadf40 (LWP 23269) 0x00007f62fbb86dbb in pthread_cond_timedwait (cond=cond@entry=0x55cf7060e038, mutex=mutex@entry=0x55cf7060e010,
abstime=abstime@entry=0x7ffc7229a640) at pthread_cond_wait.c:668
2 Thread 0x7f62f828a700 (LWP 23342) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62f82884a0, expected=0, GI_poll (fds=fds@entry=0x7f62e0003f28, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
futex_word=0x55cf705f5940) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
3 Thread 0x7f62f7a89700 (LWP 23344) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62f7a871b0, expected=0,
futex_word=0x55cf705f6580) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
4 Thread 0x7f62f7288700 (LWP 23345) 0x00007f62fa7aebf9 in
5 Thread 0x7f62f6a87700 (LWP 23346) 0x00007f62fa7aebf9 in _GIpoll (fds=fds@entry=0x7f6274008c18, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
6 Thread 0x7f62f6286700 (LWP 23347) 0x00007f62fa7aebf9 in GI_poll (fds=fds@entry=0x7f626c008188, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
7 Thread 0x7f62f5a85700 (LWP 23352) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62f5a834a0, expected=0, GI_poll (fds=fds@entry=0x7f62940720f8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
futex_word=0x55cf705fa4c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
8 Thread 0x7f62f5284700 (LWP 23354) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62f52821b0, expected=0,
futex_word=0x55cf705f9fb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
9 Thread 0x7f62f4a83700 (LWP 23355) 0x00007f62fa7aebf9 in
10 Thread 0x7f62effff700 (LWP 23356) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6284003d38, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
11 Thread 0x7f62ef7fe700 (LWP 23357) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f627c006148, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
12 Thread 0x7f62eeffd700 (LWP 23358) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62eeffb4a0, expected=0,
futex_word=0x55cf70603e04) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
13 Thread 0x7f62ee7fc700 (LWP 23359) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62ee7fa1b0, expected=0,
futex_word=0x55cf70602f10) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
14 Thread 0x7f62edffb700 (LWP 23360) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62cc0042d8, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
15 Thread 0x7f62ed7fa700 (LWP 23361) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62c4005788, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
16 Thread 0x7f62ecff9700 (LWP 23362) 0x00007f62fa7aebf9 in GI_poll (fds=fds@entry=0x7f62d8006f18, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
17 Thread 0x7f62ec7f8700 (LWP 23364) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62ec7f64a0, expected=0, GI_poll (fds=fds@entry=0x7f62b8004428, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
futex_word=0x55cf706097a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
18 Thread 0x7f62ebff7700 (LWP 23365) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62ebff51b0, expected=0,
futex_word=0x55cf70609120) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
19 Thread 0x7f62eb7f6700 (LWP 23366) 0x00007f62fa7aebf9 in
20 Thread 0x7f62eaff5700 (LWP 23367) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62a40040c8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
21 Thread 0x7f62ea7f4700 (LWP 23368) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f628c0048b8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
22 Thread 0x7f62d2ffd700 (LWP 23374) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6264005168, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
23 Thread 0x7f62d3fff700 (LWP 23375) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f625c004248, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
24 Thread 0x7f62d1ffb700 (LWP 23378) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6254004088, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
25 Thread 0x7f62d17fa700 (LWP 23379) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62600062a8, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
26 Thread 0x7f62d0ff9700 (LWP 23380) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6258003c48, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
27 Thread 0x7f62b3fff700 (LWP 23467) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f624c003b68, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
28 Thread 0x7f62b37fe700 (LWP 23468) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x55cf70d86f38, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
29 Thread 0x7f62b27fc700 (LWP 23476) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6260006078, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
30 Thread 0x7f62b0ff9700 (LWP 23496) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62c8003c18, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
31 Thread 0x7f62a3fff700 (LWP 23502) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f6270001d98, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
32 Thread 0x7f62a37fe700 (LWP 23511) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62680038a8, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
33 Thread 0x7f62a2ffd700 (LWP 23512) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f625c006278, nfds=2, timeout=999) at ../sysdeps/unix/sysv/linux/poll.c:29
34 Thread 0x7f62e9ff3700 (LWP 23549) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62e9ff14a0, expected=0,
futex_word=0x55cf7060d724) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
35 Thread 0x7f62e97f2700 (LWP 23550) 0x00007f62fbb86f85 in futex_abstimed_wait_cancelable (private=, abstime=0x7f62e97f01b0, expected=0,
futex_word=0x55cf7060d020) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
36 Thread 0x7f62b17fa700 (LWP 23551) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f626c0078f8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
37 Thread 0x7f62b1ffb700 (LWP 23552) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62740082d8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
38 Thread 0x7f62e8ff1700 (LWP 23553) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f62d80071f8, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
39 Thread 0x7f62d37fe700 (LWP 23554) 0x00007f62fa7aebf9 in GIpoll (fds=fds@entry=0x7f629c003648, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
40 Thread 0x7f62d27fc700 (LWP 23555) 0x00007f62fa7aebf9 in __GI_poll (fds=fds@entry=0x7f62a80016d8, nfds=2, timeout=1001) at ../sysdeps/unix/sysv/linux/poll.c:29
41 Thread 0x7f62b2ffd700 (LWP 23556) lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb)
7 gdb
Hello It happened again two times, here is more detailed report, the main thread:
(gdb) where
0 0x00007f62fbb86dbb in __pthread_cond_timedwait (cond=cond@entry=0x55cf7060e038, mutex=mutex@entry=0x55cf7060e010, abstime=abstime@entry=0x7ffc7229a640) at pthread_cond_wait.c:668
1 0x00007f62fa408259 in cnd_timedwait (cond=cond@entry=0x55cf7060e038, mtx=mtx@entry=0x55cf7060e010, ts=ts@entry=0x7ffc7229a640) at tinycthread.c:469
2 0x00007f62fa408663 in cnd_timedwait_abs (cnd=cnd@entry=0x55cf7060e038, mtx=mtx@entry=0x55cf7060e010, tspec=tspec@entry=0x7ffc7229a640) at tinycthread_extra.c:100
3 0x00007f62fa3d0e1f in rd_kafka_q_serve_rkmessages (rkq=0x55cf7060e010, timeout_ms=
, rkmessages=0x55cf70d90b20, rkmessages_size=10) at rdkafka_queue.c:532 4 0x00007f62fa39cbac in rd_kafka_consume_batch0 (rkmessages_size=
, rkmessages= , timeout_ms= , rkq= ) at rdkafka.c:2075 5 rd_kafka_consume_batch_queue (rkqu=
, timeout_ms= , rkmessages= , rkmessages_size= ) at rdkafka.c:2121 6 0x00007f62fb5c5305 in cppkafka::Consumer::poll_batch(unsigned long, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /home/cexprd/lib/libcppkafka.so.0.2
7 0x000055cf701b3089 in g_msgLoop<g_normal(const string&, cppkafka::Configuration&, ProducersCollection&)::<lambda(const string&, ProducersCollection&, int&)>, int>(cppkafka::Consumer &, uint32_t, ProducersCollection &, int &) (consumer=..., batch_sz=batch_sz@entry=10, producers=std::vector of length 4, capacity 4 = {...}, Context=
, fn=...) at src/./common/msg_loop.h:122 8 0x000055cf701b5ec3 in g_normal (topic_name="prd.cex.pz1.ord.tg.12", config_consumer=..., producers=std::vector of length 4, capacity 4 = {...}) at src/normal.cpp:45
9 0x000055cf701b0707 in main (argc=
, argv= ) at src/main.cpp:2
33
It seems like this is a deadloop:
3 0x00007f62fa3d0e1f in rd_kafka_q_serve_rkmessages (rkq=0x55cf7060e010, timeout_ms=
, rkmessages=0x55cf70d90b20, rkmessages_size=10) at rdkafka_queue.c:532 532 cnd_timedwait_abs(&rkq->rkq_cond, &rkq->rkq_lock,
from rdkafka_queue.c:532
530
531 while (!(rko = TAILQ_FIRST(&rkq->rkq_q)) && 532 cnd_timedwait_abs(&rkq->rkq_cond, &rkq->rkq_lock, 533 &timeout_tspec) != thrd_timedout) 534 ; 535
print outs:
(gdb) print rkq->rkq_q $6 = {tqh_first = 0x0, tqh_last = 0x55cf7060e070} (gdb) print rko $7 = (rd_kafka_op_t ) 0x0 (gdb) print rkq $8 = (rd_kafka_q_t ) 0x55cf7060e010 (gdb) print *rkq $9 = {rkq_lock = {data = {lock = 2, count = 0, owner = 23269, nusers = 1, kind = 0, spins = 0, elision = 0, list = {prev = 0x0, next = 0x0}}, size = "\002\000\000\000\000\000\000\000\345Z\000\000\001", '\000' <repeats 26 times>, align = 2}, rkq_cond = {data = {{wseq = 308019789, wseq32 = {low = 308019789, high = 0}}, { g1_start = 308019705, g1_start32 = {low = 308019705, high = 0}}, g_refs = {0, 0}, __g_size = {0, 4294967287}, g1_orig_size = 132, wrefs = 0, g_signals = {0, 0}}, size = "M\002\\022\000\000\000\000\371\001\\022", '\000' <repeats 16 times>, "\367\377\377\377\204", '\000' <repeats 14 times>, align = 308019789}, rkq_fwdq = 0x0, rkq_q = {tqh_first = 0x0, tqh_last = 0x55cf7060e070}, rkq_qlen = 0, rkq_qsize = 0, rkq_refcnt = 4, rkq_flags = 3, rkq_rk = 0x55cf70945350, rkq_qio = 0x0, rkq_serve = 0x0, rkq_opaque = 0x0, rkq_name = 0x7f62fa467740 <__FUNCTION__.26527> "rd_kafka_cgrp_new"} (gdb) print timeout_tspec $10 = {tv_sec = 1551416087, tv_nsec = 1000000000} (gdb) print thrd_timedout No symbol "thrd_timedout" in current context. (gdb) print timeout_ms $11 =
I can provide a coredump on demand, pls help!
1 0x00007f62fa408259 in cnd_timedwait (cond=cond@entry=0x55cf7060e038, mtx=mtx@entry=0x55cf7060e010, ts=ts@entry=0x7ffc7229a640) at tinycthread.c:469
469 ret = pthread_cond_timedwait(cond, mtx, ts); (gdb) print cond $17 = {data = {{wseq = 308019789, wseq32 = {low = 308019789, high = 0}}, {g1_start = 308019705, g1_start32 = {low = 308019705, high = 0}}, __g_refs = {0, 0}, g_size = {0, 4294967287}, g1_orig_size = 132, wrefs = 0, __g_signals = {0, 0}}, size = "M\002\\022\000\000\000\000\371\001\\022", '\000' <repeats 16 times>, "\367\377\377\377\204", '\000' <repeats 14 times>, align = 308019789} (gdb) print mtx $18 = {data = {lock = 2, count = 0, owner = 23269, nusers = 1, kind = 0, spins = 0, elision = 0, list = {prev = 0x0, next = 0x0}}, size = "\002\000\000\000\000\000\000\000\345Z\000\000\001", '\000' <repeats 26 times>, __align = 2} (gdb) print *ts $19 = {tv_sec = 1551416087, tv_nsec = 1000000000}
(gdb) print *ts
$10 = {tv_sec = 1551416087, tv_nsec = 1000000000}`
I believe that is the problem, the nanosecond field is >= 1s. This was fixed in #2108 by @boatfish : 672dbf42d2a0ab8694be160740dad48554178a7c
We'll have a v1.0.0 release shortly, suggest you try out the latest RC in the meantime: v1.0.0-RC7
@wuqingjun
40K consumer - Just curious -are is broker able to scale to 40K consumer in a consumer group?
Description - updated
The dump file shows 15 threads, 4 of them shows very long CPU time. I am posting the call stacks of the 4 threads here, if you like, you can also open the attached file.
The symptom is similar to this issue but it should have been fixed: https://github.com/edenhill/librdkafka/issues/1569
And I really see the Kafka brokers are down and up at the time this issue happened, but I am unable to reproduce this.
Dump file info, I got this by running ~*k command in windbg: ...
How to reproduce
Unfortunately I cannot reproduce this issue.
When this issue happened, we have about 40K consumers and 10K producers connected to 10 Kafka Brokers.
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information: