confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
286 stars 3.15k forks source link

rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 #558

Closed wucanhui closed 8 years ago

wucanhui commented 8 years ago

We are using librdkafka-0.9.0.99, and it crash when keeping on consuming message for hours. The logs message: Consume failed: Local: Operation in progress Consume failed: Local: Operation in progress Consume failed: Local: Operation in progress Consume failed: Local: Operation in progress Consume failed: Local: Operation in progress * rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 * Aborted (core dumped)

The gdb info is like this: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff40195000 Core was generated by `./etld etl.properties'. Program terminated with signal 6, Aborted.

0 0x00007ff1e34630d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt

0 0x00007ff1e34630d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007ff1e346683b in abort () from /lib/x86_64-linux-gnu/libc.so.6

2 0x00007ff1e28402aa in rd_kafka_crash (file=, line=, function=, rk=0x0,

reason=<optimized out>) at rdkafka.c:2096

3 0x00007ff1e2870ad5 in rd_kafka_toppar_remove (rktp=0x7ff1bc003020) at rdkafka_partition.c:165

4 0x00007ff1e2870af2 in rd_kafka_toppar_destroy_final (rktp=0x7ff1bc003020) at rdkafka_partition.c:183

5 0x00007ff1e28514d8 in rd_kafka_topic_leader_update (rkb=0x7ff1d8002890, partition=1, rkt=0x7ff1bc0011d0,

leader_id=<optimized out>) at rdkafka_topic.c:351

6 rd_kafka_topic_metadata_update (rkb=0x7ff1b8003f40, mdt=0x7ff1b8001547) at rdkafka_topic.c:735

7 0x00007ff1e2860d31 in rd_kafka_parse_Metadata (rkb=0x7ff1b8003f40, rkt=0x7ff1bc0011d0, rkbuf=)

at rdkafka_request.c:1657

8 0x00007ff1e28692b8 in rd_kafka_op_handle_Metadata (rkb=0x7ff1b8003f40, err=RD_KAFKA_RESP_ERR_NO_ERROR, rkbuf=0x2325640,

request=<optimized out>, opaque=0x7ff1bc000970) at rdkafka_request.c:1721

9 0x00007ff1e2859959 in rd_kafka_buf_callback (rkb=0x7ff1b8003f40, err=RD_KAFKA_RESP_ERR_NO_ERROR, response=0x2325640,

request=0x23257f0) at rdkafka_buf.c:472

10 0x00007ff1e2838609 in rd_kafka_toppar_q_cb (rk=, rko=, cb_type=,

opaque=<optimized out>) at rdkafka.c:886

11 0x00007ff1e285bd0f in rd_kafka_q_serve (rkq=0x7ff1d8001670, timeout_ms=, max_cnt=, cb_type=0,

callback=0x7ff1e2838600 <rd_kafka_toppar_q_cb>, opaque=0x0) at rdkafka_queue.c:371

12 0x00007ff1e2841dbd in rd_kafka_toppars_q_serve (timeout_ms=, rkq=0x7ff1d8001670) at rdkafka.c:892

13 rd_kafka_thread_main (arg=0x7ff1d80015e0) at rdkafka.c:929

14 0x00007ff1e28781ff in _thrd_wrapper_function (aArg=) at tinycthread.c:599

15 0x00007ff1e3d08e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

16 0x00007ff1e352038d in clone () from /lib/x86_64-linux-gnu/libc.so.6

17 0x0000000000000000 in ?? ()

Do you know why it crash like this? Thanks.

edenhill commented 8 years ago

From the backtrace it seems that one of your topics were suddenly missing from the metadata returned from the broker, or that the partition count changed for a topic. Does that sound familiar?

librdkafka still shouldnt fail because of this and I'll try to reproduce it.

alanyin commented 8 years ago

I have the exact same issue, the coredump stacks is the same. Is this issue still open? Any idea when it will be fixed?

edenhill commented 8 years ago

Is this still happening on latest master?

alanyin commented 8 years ago

I downloaded the source on 9 Apr. Didn't try the latest one? Should I try latest master?

Alan

From: Magnus Edenhill notifications@github.com<mailto:notifications@github.com> Reply-To: edenhill/librdkafka reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, May 9, 2016 at 4:23 PM To: edenhill/librdkafka librdkafka@noreply.github.com<mailto:librdkafka@noreply.github.com> Cc: Alan Yin ayin@threatmetrix.com<mailto:ayin@threatmetrix.com>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [edenhill/librdkafka] rdkafka_partition.c:165:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 (#558)

Is this still happening on latest master?

You are receiving this because you commented. Reply to this email directly or view it on GitHubhttps://github.com/edenhill/librdkafka/issues/558#issuecomment-217784806

edenhill commented 8 years ago

Yes please, try latest master and if problem persists outline exact steps to reproduce. Thanks

edenhill commented 8 years ago

This seems to be caused by connecting to two clusters simultaneously where the topic state differs. This is a problem on the user side but we still shouldn't crash.

jleducq commented 8 years ago

This error is still happening with the latest version: librdkafka version: latest version kafka version: 0.10.0

My setup:

bin/kafka-topics.sh --zookeeper server1,server2,server3,server4,server5,server6 --describe --topic myTopic Topic:myTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: myTopic Partition: 0 Leader: 1 Replicas: 1,2 Isr: 2,1 Topic: myTopic Partition: 1 Leader: 3 Replicas: 3,4 Isr: 4,3 Topic: myTopic Partition: 2 Leader: 5 Replicas: 5,6 Isr: 5,6

My application is producing on partition 1 and consuming from partition 0 and 2.

Scenario:

  1. While the application is running (producing) I shutdown the Kafka servers 3 and 4 (both replicas for partition 1) to see the application reaction.
  2. I also had this issue without doing anything (all my partitions/brokers were up) and it failed randomly overnight while the application was not producing/consuming. I had to restart every single kafka broker to restart the application. Otherwise it would automatically fail at startup with the below error.

Application log:

INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success INFO - DeliveryReportCallback: Ack of message delivery for (124 bytes): Success INFO - KafkaEventCallback: Received a log event: server4:9092/4: Receive failed: Disconnected(3, FAIL) ERROR - KafkaEventCallback: Received an error event: server4:9092/4: Receive failed: Disconnected(Local: Broker transport failure) INFO - KafkaEventCallback: Received a log event: Topic myTopic partition count changed from 3 to 1(5, PARTCNT) INFO - KafkaEventCallback: Received a log event: Topic myTopic: partition [1] is unknown (partition_cnt 1)(4, LEADER) * rdkafka_partition.c:207:rd_kafka_toppar_remove: assert: rktp->rktp_removed==0 * Aborted (core dumped)

GDB back trace:

0 0x0000003c24432625 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-18.el6_4.x86_64 glibc-2.12-1.149.el6_6.7.x86_64 gperftools-libs-2.0-11.el6.3.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-37.el6_6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libcurl-7.19.7-40.el6_6.4.x86_64 libgcc-4.4.7-11.el6.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libssh2-1.4.2-1.el6_6.1.x86_64 libstdc++-4.4.7-11.el6.x86_64 libunwind-1.1-2.el6.x86_64 libuuid-2.17.2-12.18.el6.x86_64 nspr-4.10.6-1.el6_5.x86_64 nss-3.16.2.3-3.el6_6.x86_64 nss-softokn-freebl-3.14.3-22.el6_6.x86_64 nss-util-3.16.2.3-2.el6_6.x86_64 openssl-1.0.1e-30.el6.8.x86_64 xerces-c-3.0.1-20.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt

0 0x0000003c24432625 in raise () from /lib64/libc.so.6

1 0x0000003c24433e05 in abort () from /lib64/libc.so.6

2 0x00007f4cc612ad33 in rd_kafka_crash (file=file@entry=0x7f4cc617f6a5 "rdkafka_partition.c", line=line@entry=207, function=function@entry=0x7f4cc6180460 <FUNCTION.22025> "rd_kafka_toppar_remove", rk=rk@entry=0x0,

reason=reason@entry=0x7f4cc617f70a "assert: rktp->rktp_removed==0") at rdkafka.c:2395

3 0x00007f4cc6169b31 in rd_kafka_toppar_remove (rktp=rktp@entry=0x7f4c68000b40) at rdkafka_partition.c:207

4 0x00007f4cc6169b52 in rd_kafka_toppar_destroy_final (rktp=0x7f4c68000b40) at rdkafka_partition.c:225

5 0x00007f4cc614dd8c in rd_kafka_op_destroy (rko=0x7f4c680238b0) at rdkafka_op.c:106

6 0x00007f4cc613b827 in rd_kafka_broker_op_serve (rkb=rkb@entry=0x13275f0, rko=0x7f4c680238b0) at rdkafka_broker.c:3016

7 0x00007f4cc613c32b in rd_kafka_broker_serve (rkb=rkb@entry=0x13275f0, timeout_ms=timeout_ms@entry=10) at rdkafka_broker.c:3035

8 0x00007f4cc613c689 in rd_kafka_broker_ua_idle (rkb=rkb@entry=0x13275f0, timeout_ms=timeout_ms@entry=0) at rdkafka_broker.c:3095

9 0x00007f4cc613cf29 in rd_kafka_broker_thread_main (arg=arg@entry=0x13275f0) at rdkafka_broker.c:4407

10 0x00007f4cc6171107 in _thrd_wrapper_function (aArg=) at tinycthread.c:611

11 0x0000003c24c079d1 in start_thread () from /lib64/libpthread.so.0

12 0x0000003c244e88fd in clone () from /lib64/libc.so.6

Note that I am using the KafkaConsumer, not the legacy Consumer. I also do not have a custom rebalance callback.

Thanks.

edenhill commented 8 years ago

It looks like this happens when different brokers report different partition counts for the same topic.

jleducq commented 8 years ago

Just saw that this bug was part of 0.9.2 milestone. Do you have any date in mind for the release? Thanks.

edenhill commented 8 years ago

Ive reproduced it and will fix it soon.

edenhill commented 8 years ago

This is fixed on the partition_changes branch

edenhill commented 8 years ago

This issue is now fixed on master, please try to verify the fix in your environment. Thank you.