confluentinc / confluent-kafka-go

Confluent's Apache Kafka Golang client
Apache License 2.0
4.64k stars 658 forks source link

Getting PAUSE in Kafka Consumer while consuming the records. #535

Open vasundhara785 opened 4 years ago

vasundhara785 commented 4 years ago

Description

When do we get PAUSE from librdkafka without explicitly calling pause ?The application seems to be getting PAUSE and rebalancing is occurring. And the application processing is not taking much time .How to avoid such scenarios?This is making the application to consume very slowly each msg for 10mins or so.

ev := Consumer.Poll(500) //Polls used 500ms

Configurations used

    "queued.max.messages.kbytes.int": "20480",
    "fetch.message.max.bytes.int": "20480",
    "socket.keepalive.enable.boolean": "true",
    "metadata.max.age.ms.int": "900000",
    "session.timeout.ms.int": "30000",
    "enable.auto.commit.boolean": "true",
    "metadata.request.timeout.ms.int": "30000",
    "socket.timeout.ms.int": "30000",
    "auto.commit.interval.ms.int": "30000",
    "heartbeat.interval.ms.int": "3000",
    "fetch.wait.max.ms.int": "100",
    "auto.offset.reset.string": "latest",
    "debug.string": "All"

gopkg.in/confluentinc/confluent-kafka-go.v1 v1.4.2

Debug logs when PAUSE occurred.

%7|1601001842.871|FETCH|rdkafka#consumer-1| [thrd:sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/]: sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/0: Fetch topic egress-1-1 [18] at offset 11489 (v10) %7|1601001842.871|CONSUME|rdkafka#consumer-1| [thrd:sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/]: sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/0: Enqueue 1 message(s) (2901 bytes, 1 ops) on egress-1-1 [26] fetch queue (qlen 1, v10, last_offset 11529, 0 ctrl msgs, uncompressed) %7|1601001842.871|FETCH|rdkafka#consumer-1| [thrd:sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/]: sasl_ssl://CI-egress-Tel-AIR700071.servicebus.windows.net:9093/0: Topic egress-1-1 [23] MessageSet size 0, error "Success", MaxOffset 11564, LSO 11564, Ver 10/10 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [8] (v11) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [9]: at offset 55904 (state active, v11) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [10] (v11) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [13] (v11) %4|1601001842.873|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state started) after 30387 ms without a successful response from the group coordinator (broker 0, last error was Local: Broker transport failure): revoking assignment and rejoining group %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [15] (v11) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [12]: at offset 55418 (state active, v11) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Library pausing 8 partition(s) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delegating revoke of 8 partition(s) to application rebalance callback on queue rd_kafka_cgrp_new: Consumer group session timed out (in join-state started) after 30387 ms without a successful response from the group coordinator (broker 0, last error was Local: Broker transport failure) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [9] (v11) %7|1601001842.873|MEMBERID|rdkafka#consumer-1| [thrd:main]: Group "smppclient": updating member id "si-egress-tel-air700071.servicebus.windows.net:c:smppclient:I:rdkafka-dfa0b867f4fc4354bc9015ccb8bfdfc4" -> "" %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "smppclient" changed join state started -> wait-revoke-rebalance_cb (v14, state up) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [10]: at offset 56470 (state active, v11) %7|1601001842.873|REBALANCE|rdkafka#consumer-1| [thrd:main]: Group "smppclient" is rebalancing in state up (join-state started) with assignment: Consumer group session timed out (in join-state started) after 30387 ms without a successful response from the group coordinator (broker 0, last error was Local: Broker transport failure) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [11]: at offset 56265 (state active, v11) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [14] (v11) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [8]: at offset 55955 (state active, v11) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v11 %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [11] (v11) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [12] (v11) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.873|CGRPJOINSTATE|rdkafka#consumer-1| [thrd:main]: Group "smppclient" changed join state wait-revoke-rebalance_cb -> wait-unassign (v15, state up) %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [11]: at offset 56265 (state stopped, v13) %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [15] in state active (v12) %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [13] %7|1601001842.873|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Group "smppclient": unassigning 8 partition(s) (v16) %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (1 wait_unassign, 1 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15]: offset store terminating %7|1601001842.874|OFFSET|rdkafka#consumer-1| [thrd:main]: GroupCoordinator/0: Enqueue OffsetCommitRequest(v3, 8/8 partition(s))): unassign %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11]: offset store terminating %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [9]: at offset 55904 (state stopped, v13) %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (7 wait_unassign, 7 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [10] changed fetch state active -> stopping %7|1601001842.874|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "smppclient": assigning 0 partition(s) in join state wait-unassign %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [14] %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [12] (v12) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [14] (v12) %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [10]: stored offset 56470, committed offset 56460: setting stored offset 56470 for commit %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [14] %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: Group "smppclient": rd_kafka_cgrp_unassign:2476: new version barrier v16 %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [8]: at offset 55955 (state stopped, v13) %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [10] changed fetch state stopping -> stopped %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [9] %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [13]: stored offset 56014, committed offset 56004: setting stored offset 56014 for commit %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [10]: at offset 56470 (state stopped, v13) %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10]: offset store terminating %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [13] %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [15]: stored offset 56088, committed offset 56074: setting stored offset 56088 for commit %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [9] %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [10] %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [8] in state active (v12) %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [8] %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [8] %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [9] in state active (v12) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [14] in state active (v12) %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [12]: stored offset 55418, committed offset 55409: setting stored offset 55418 for commit %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9]: offset store terminating %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [13] changed fetch state stopping -> stopped %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [9] (v13) %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [10] %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [14] (v13) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (8 wait_unassign, 8 assigned, 1 wait commit, join state wait-unassign): unassign %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [15] (v12) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [13] in state active (v12) %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [11] in state active (v12) %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [12] (v13) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [11] (v12) %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [9] (v12) %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [9] %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [13] %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [9] changed fetch state stopping -> stopped %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [8] changed fetch state active -> stopping %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [11] %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [12] in state active (v12) %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [12] %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [11] %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [8] (v13) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [10] (v12) %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [11]: stored offset 56265, committed offset 56258: setting stored offset 56265 for commit %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [9]: stored offset 55904, committed offset 55896: setting stored offset 55904 for commit %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (2 wait_unassign, 2 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [10] %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [14]: stored offset 56024, committed offset 56015: setting stored offset 56024 for commit %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Library resuming 8 partition(s) %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [13] changed fetch state active -> stopping %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [13]: at offset 56014 (state stopped, v13) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.873|ASSIGN|rdkafka#consumer-1| [thrd:main]: Group "smppclient": new assignment of 0 partition(s) in join state wait-revoke-rebalance_cb %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [11] (v13) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [14] %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (6 wait_unassign, 6 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [14] changed fetch state stopping -> stopped %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (5 wait_unassign, 5 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [15]: at offset 56088 (state stopped, v13) %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [13] (v13) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13]: offset store terminating %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [11] changed fetch state active -> stopping %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [15] (v13) %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [15] %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (3 wait_unassign, 3 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.873|BARRIER|rdkafka#consumer-1| [thrd:main]: Group "smppclient": rd_kafka_cgrp_assign:2572: new version barrier v15 %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [10] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [14] %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [12] changed fetch state active -> stopping %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12]: offset store terminating %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [12] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [15]: at offset 56088 (state active, v11) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|FETCH|rdkafka#consumer-1| [thrd:main]: Stopping fetch for egress-1-1 [10] in state active (v12) %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14]: offset store terminating %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [9] changed fetch state active -> stopping %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Resume egress-1-1 [10] (v13) %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [13] %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [11] %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [9] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (0 wait_unassign, 0 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.873|OFFSET|rdkafka#consumer-1| [thrd:main]: Topic egress-1-1 [8]: stored offset 55955, committed offset 55945: setting stored offset 55955 for commit %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [12] %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [14]: at offset 56024 (state active, v11) %7|1601001842.874|UNASSIGN|rdkafka#consumer-1| [thrd:main]: Unassign not done yet (4 wait_unassign, 4 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [15] %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [14]: at offset 56024 (state stopped, v13) %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [8] (v12) %7|1601001842.874|STORETERM|rdkafka#consumer-1| [thrd:main]: egress-1-1 [8]: offset store terminating %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [12] changed fetch state stopping -> stopped %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [8] %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13]: rd_kafka_toppar_op_fetch_stop:2334: new version barrier v12 %7|1601001842.874|RESUME|rdkafka#consumer-1| [thrd:main]: Not resuming stopped egress-1-1 [12]: at offset 55418 (state stopped, v13) %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13] received op FETCH_STOP (v12) in fetch-state active (opv11) %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [9] %7|1601001842.873|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op ASSIGN (v0) in state up (join state wait-revoke-rebalance_cb, v14 vs 0) %7|1601001842.874|BARRIER|rdkafka#consumer-1| [thrd:main]: egress-1-1 [14]: rd_kafka_toppar_op_pause_resume:2393: new version barrier v13 %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [15] %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [15] %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|SEND|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/0: Sent OffsetCommitRequest (v3, 179 bytes @ 0, CorrId 20980) %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [8] changed fetch state stopping -> stopped %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [12] %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op PARTITION_LEAVE in state up (join state wait-unassign, v16) for egress-1-1 [11] %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [15] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.873|PAUSE|rdkafka#consumer-1| [thrd:main]: Pause egress-1-1 [13]: at offset 56014 (state active, v11) %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [15] changed fetch state active -> stopping %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [15] changed fetch state stopping -> stopped %7|1601001842.874|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [11] received op PAUSE (v13) in fetch-state stopped (opv12) %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [14] changed fetch state active -> stopping %7|1601001842.874|DESP|rdkafka#consumer-1| [thrd:main]: Removing (un)desired topic egress-1-1 [8] %7|1601001842.874|CGRPOP|rdkafka#consumer-1| [thrd:main]: Group "smppclient" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v16) for egress-1-1 [10] %7|1601001842.874|CONSUMER|rdkafka#consumer-1| [thrd:main]: Stop consuming egress-1-1 [13] (v12) %7|1601001842.873|OP|rdkafka#consumer-1| [thrd:main]: egress-1-1 [13] received op PAUSE (v11) in fetch-state active (opv10) %7|1601001842.874|COMMIT|rdkafka#consumer-1| [thrd:main]: GroupCoordinator/0: Committing offsets for 8 partition(s): unassign %7|1601001842.874|PARTSTATE|rdkafka#consumer-1| [thrd:main]: Partition egress-1-1 [11] changed fetch state stopping -> stopped %7|1601001842.874|PARTDEL|rdkafka#consumer-1| [thrd:main]: Group "smppclient": delete egress-1-1 [12] %7|1601001842.876|RECV|rdkafka#consumer-1| [thrd:sasl_ssl://TBS-TEL-CI-AKS-AIRTEL-01.servicebus.windows.net:9093]: sasl_ssl://TBS-TEL-CI-AKS-AIRTEL-01.servicebus.windows.net:9093/0: Received FetchResponse (v4, 3582 bytes, CorrId 432981, rtt 151.99ms)

edenhill commented 4 years ago

The PAUSE is just an internal means to stop providing messages to the application when a rebalance happens, the rebalance itself seems to be caused by the GroupCoordinator connection/requests timing out: %4|1601001842.873|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state started) after 30387 ms without a successful response from the group coordinator (broker 0, last error was Local: Broker transport failure): revoking assignment and rejoining group

This is typically caused by networking issues. Look in the debug logs for the HeartbeatResponses and what their rtt (latency) is.

vasundhara785 commented 4 years ago

Ok .This is what the hearbeatresponse rtt is...

%7|1601001842.921|RECV|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/0: Received HeartbeatResponse (v1, 6 bytes, CorrId 20981, rtt 25.78ms)

How could we avoid getting such PAUSE or what is the solution to be handled?

edenhill commented 4 years ago

Don't focus on the PAUSE, it is not relevant, it is the rebalance reason that is relevant and in this case it seems to be network related. Can you provide the full debug logs with debug=cgrp,protocol so we can observe the heartbeats?

vasundhara785 commented 4 years ago

rebalance vil 2d3cd pod.xlsx

Can you please look into the attached xls file and i have done debug=all.

edenhill commented 4 years ago

The log for consumer-1 is incomplete, it begins when the consumer is already fetching and only two seconds before the session timeout. Please provide all logs for consumer-1 only up to timestamp 1601000770.562

vasundhara785 commented 4 years ago

rebalance vil 02 26.xlsx

All logs for consumer-1 before the mentioned timestamp.

One thing to know What is the time normally taken to rebalance ?Is there any configuration to reduce the time taken to rebalance ?