confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
271 stars 3.15k forks source link

assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312

Closed mensfeld closed 1 year ago

mensfeld commented 1 year ago

Hey,

This is expansion of the report made here: https://github.com/confluentinc/librdkafka/issues/4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.

If you consider it part of https://github.com/confluentinc/librdkafka/issues/4308 please merge them :pray:

How to reproduce

  1. Initialize a few consumers in the same process (multiple threads) or many processes.
  2. Let them rebalance and during a cooperative-sticky rebalance, try to close them
  3. Profit (crash)

Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:

loop do
  [1,2,3,4,5].map do
    Thread.new do
      consumer = setup_rdkafka_consumer(
        'partition.assignment.strategy': 'cooperative-sticky'
      )
      consumer.subscribe('topic')
      consumer.poll(500)
      consumer.close
    end
  end.each(&:join)
end

the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.

Every single execution ends up with this:

ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
Aborted (core dumped)

I confirmed this behavior exists in the following librdkafka versions:

Now let's dive deeper:

  1. As a temporary mitigation, it is possible to call rd_kafka_unsubscribe and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of running rd_kafka_destroy. The wait is needed despite rd_kafka_assignment returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.
  2. Locking the rebalance callback via mutex seems to solve the problem; however, my integration suite under load still showed that it crashes (though way less).
  3. Waiting after rd_kafka_consumer_close also partially mitigates this due to the fact, that rd_kafka_consumer_close will unsubscribe automatically. This may mitigate this on a long living consumers (ref: https://github.com/appsignal/rdkafka-ruby/issues/254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.
  4. This issue does not exist with the standard assignment strategy (only cooperative sticky).
  5. When using a custom rebalance callback with mutex this issue is also partially mitigated, especially when having a global state on a process that would indicate that we're in a rebalance process, however there is a case, where the rebalance callback is not yet triggered but the rdkafka_sticky_assignor is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of the rdkafka_sticky_assignor and it handing to rebalance callback however issue persists.
  6. It can affects more complex consumer frameworks like karafka less because they do a lot of stuff around the consumer before running the close, hence the probability of being in the rebalance state is lower (thought it can happen).
  7. I did not find similar assertion checks in rdkafka_assignor.c nor in rdkafka_roundrobin_assignor.
  8. Not using rd_kafka_destroy on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).
  9. Really long initial poll on the first rebalance for a consumer instance (more than 2 seconds) seems to also mitigate this at least partially. This may be because of the fact that this poll takes longer than the whole rebalance, but hard for me to give 100% result here.

Suggested fix

The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.

Logs

Here is the debug all info tail (if you need more just ping me, I can generate it on the spot):

D, [2023-06-09T09:35:10.457239 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457246 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: JoinGroup response clean-up
D, [2023-06-09T09:35:10.457254 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-join -> wait-metadata (state up)
D, [2023-06-09T09:35:10.457262 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Request metadata for 1 topic(s): partition assignor
D, [2023-06-09T09:35:10.457271 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent MetadataRequest (v9, 41 bytes @ 0, CorrId 5)
D, [2023-06-09T09:35:10.457278 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received MetadataResponse (v9, 341 bytes, CorrId 5, rtt 0.51ms)
D, [2023-06-09T09:35:10.457286 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ===== Received metadata (for 1 requested topics): partition assignor =====
D, [2023-06-09T09:35:10.457294 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ClusterId: o9n0mScnRY-4mvOHIvPnZQ, ControllerId: 1001
D, [2023-06-09T09:35:10.457302 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1 brokers, 1 topics
D, [2023-06-09T09:35:10.457309 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Topic 6bad105f430f with 10 partitions
D, [2023-06-09T09:35:10.457317 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001:   Broker #0/1: localhost:9092 NodeId 1001
D, [2023-06-09T09:35:10.457325 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1/1 requested topic(s) seen in metadata
D, [2023-06-09T09:35:10.457332 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" running cooperative-sticky assignor for 1 member(s) and 1 eligible subscribed topic(s):
D, [2023-06-09T09:35:10.457340 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) with 0 owned partition(s) and 1 subscribed topic(s):
D, [2023-06-09T09:35:10.457349 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [-1]
D, [2023-06-09T09:35:10.457356 #2162967] DEBUG -- : rdkafka: [thrd:main]: Sort 10 partitions in existing assignment
D, [2023-06-09T09:35:10.457364 #2162967] DEBUG -- : rdkafka: [thrd:main]: Prepared balanced reassignment for 1 consumers, 0 available partition(s) where of 0 are unassigned (initializing=true, revocationRequired=false, 1 fixed assignments)
D, [2023-06-09T09:35:10.457372 #2162967] DEBUG -- : rdkafka: [thrd:main]: Reassignment not performed after 1 iteration(s) of 0 reassignable partition(s)
D, [2023-06-09T09:35:10.457380 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" cooperative-sticky assignment for 1 member(s) finished in 0.037ms:
D, [2023-06-09T09:35:10.457387 #2162967] DEBUG -- : rdkafka: [thrd:main]:  Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) assigned 10 partition(s):
D, [2023-06-09T09:35:10.457395 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [0]
D, [2023-06-09T09:35:10.457403 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [1]
D, [2023-06-09T09:35:10.457410 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [2]
D, [2023-06-09T09:35:10.457418 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [3]
D, [2023-06-09T09:35:10.457425 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [4]
D, [2023-06-09T09:35:10.457432 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [5]
D, [2023-06-09T09:35:10.457440 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [6]
D, [2023-06-09T09:35:10.457460 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [7]
D, [2023-06-09T09:35:10.457470 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [8]
D, [2023-06-09T09:35:10.457477 #2162967] DEBUG -- : rdkafka: [thrd:main]:   6bad105f430f [9]
D, [2023-06-09T09:35:10.457485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": "cooperative-sticky" assignor run for 1 member(s)
D, [2023-06-09T09:35:10.457493 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Partitions owned by members: 0, partitions assigned by assignor: 10
D, [2023-06-09T09:35:10.457501 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": COOPERATIVE protocol collection sizes: maybe revoking: 0, ready to migrate: 10, unknown but owned: 0
D, [2023-06-09T09:35:10.457508 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": 10 partitions assigned to consumers
D, [2023-06-09T09:35:10.457518 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-metadata -> wait-sync (state up)
D, [2023-06-09T09:35:10.457526 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent SyncGroupRequest (v3, 221 bytes @ 0, CorrId 6)
D, [2023-06-09T09:35:10.457533 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457541 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received SyncGroupResponse (v3, 10 bytes, CorrId 6, rtt 0.21ms)
D, [2023-06-09T09:35:10.457549 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: SyncGroupRequest failed: Broker: Group rebalance in progress: actions Permanent
D, [2023-06-09T09:35:10.457556 #2162967] DEBUG -- : rdkafka: [thrd:main]: SyncGroup response: Broker: Group rebalance in progress (0 bytes of MemberState data)
D, [2023-06-09T09:35:10.457564 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": synchronization failed: Broker: Group rebalance in progress: rejoining
D, [2023-06-09T09:35:10.457572 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Rejoining group without an assignment: SyncGroup error: Broker: Group rebalance in progress
D, [2023-06-09T09:35:10.457580 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-sync -> init (state up)
I, [2023-06-09T09:35:10.458469 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.185397148132324 ms
D, [2023-06-09T09:35:10.458501 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>8, :topic=>"6bad105f430f", :payload=>"9"}
I, [2023-06-09T09:35:10.468784 #2162967]  INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.214381694793701 ms
D, [2023-06-09T09:35:10.468830 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>9, :topic=>"6bad105f430f", :payload=>"10"}
D, [2023-06-09T09:35:10.941137 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941271 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941342 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state init)
D, [2023-06-09T09:35:10.941424 #2162967] DEBUG -- : rdkafka: [thrd:main]: Terminating group "node.rb-e4ef35230e0d_app" in state up with 0 partition(s)
D, [2023-06-09T09:35:10.941485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": unsubscribe from current subscription of size 1 (leave group=true, has joined=true, rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69, join-state init)
D, [2023-06-09T09:35:10.941568 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": clearing subscribed topics list (1)
D, [2023-06-09T09:35:10.941634 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941700 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941753 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": effective subscription list changed from 1 to 0 topic(s):
D, [2023-06-09T09:35:10.941841 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: unsubscribe
D, [2023-06-09T09:35:10.941904 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" initiating rebalance (COOPERATIVE) in state up (join-state init) with 0 assigned partition(s): unsubscribe
D, [2023-06-09T09:35:10.941950 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state wait-join)
ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.

and Kafka log matching this time:

karafka_21_kafka | [2023-06-09 07:35:07,160] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 1 (__consumer_offsets-31) (reason: Adding new member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-390196ef-cf8f-4d1a-a029-b64793413735] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 2 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e023de05-5c28-4577-af61-bc3ac4217652] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 2 (__consumer_offsets-31) (reason: removing member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 3 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 3 (__consumer_offsets-31) (reason: Adding new member rdkafka-275f4343-0436-4314-899a-14bd98f1c20b with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 4 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,377] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 4. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,378] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 4 (__consumer_offsets-31) (reason: Adding new member rdkafka-83086da6-e6b6-4744-9330-635edd095984 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,887] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-275f4343-0436-4314-899a-14bd98f1c20b] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 5 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-83086da6-e6b6-4744-9330-635edd095984] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 5 (__consumer_offsets-31) (reason: removing member rdkafka-83086da6-e6b6-4744-9330-635edd095984 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 6 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,399] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 6 (__consumer_offsets-31) (reason: Adding new member rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,400] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 7 (__consumer_offsets-31) with 2 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,404] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 7. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,935] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 7 (__consumer_offsets-31) (reason: removing member rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 8 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,445] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 8 (__consumer_offsets-31) (reason: Adding new member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 9 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 9 (__consumer_offsets-31) (reason: Adding new member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Member rdkafka-d7596725-b61d-4386-a4e3-6b35e6994d75 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Stabilized group node.rb-c0211bcb57d6_app generation 4 (__consumer_offsets-25) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,014] INFO [GroupCoordinator 1001]: Member rdkafka-d518effd-ace9-4f38-a3b2-7736948e2109 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,015] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e293c87d05db_app generation 4 (__consumer_offsets-27) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,221] INFO [GroupCoordinator 1001]: Member rdkafka-bbfb59a4-11b8-4f61-a08e-fe34f54e25aa in group node.rb-1969f87a8fdf_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Stabilized group node.rb-1969f87a8fdf_app generation 2 (__consumer_offsets-48) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc] in group node.rb-1969f87a8fdf_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-1969f87a8fdf_app in state PreparingRebalance with old generation 2 (__consumer_offsets-48) (reason: removing member rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Group node.rb-1969f87a8fdf_app with generation 3 is now empty (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 in group node.rb-e4ef35230e0d_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 10 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,447] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 10 (__consumer_offsets-31) (reason: removing member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 11 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-c0211bcb57d6_app in state PreparingRebalance with old generation 4 (__consumer_offsets-25) (reason: removing member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Group node.rb-c0211bcb57d6_app with generation 5 is now empty (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e293c87d05db_app in state PreparingRebalance with old generation 4 (__consumer_offsets-27) (reason: removing member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Group node.rb-e293c87d05db_app with generation 5 is now empty (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator)
mensfeld commented 1 year ago

It seems to be a duplicate of: https://github.com/confluentinc/librdkafka/pull/4252

Once 2.2.0 is released I will close this one and the rdkafka-ruby related one.

mensfeld commented 1 year ago

@emasab I can still reproduce it with 2.2.0:

rdkafka_sticky_assignor.c:2157: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
emasab commented 1 year ago

Hello @mensfeld . This seems like a different thing, the assignor_state is NULL initially and it's only set after assignment. But its value isn't checked by the code before calling the function. I mark it as a bug to reproduce with a test and fix.

mensfeld commented 1 year ago

@emasab thanks. So theoretically I should be able to mitigate it for now by always waiting for the first rebalance to finish on our side right?

emasab commented 1 year ago

That should mitigate it. Thanks for the report!

mensfeld commented 1 year ago

Awesome. I will implement this strategy on our side then as a temporary measure.

mensfeld commented 1 year ago

@emasab

is NULL initially and it's only set after assignment.

Is is also set after an empty assignment? Can I assume, that the moment rb_cb kicks in, I'm good to go with the shutdown?

mensfeld commented 1 year ago

@emasab FYI I was able to repro + mitigate on my side. Thanks