Closed mensfeld closed 1 year ago
It seems to be a duplicate of: https://github.com/confluentinc/librdkafka/pull/4252
Once 2.2.0 is released I will close this one and the rdkafka-ruby related one.
@emasab I can still reproduce it with 2.2.0:
rdkafka_sticky_assignor.c:2157: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
Hello @mensfeld . This seems like a different thing, the assignor_state
is NULL initially and it's only set after assignment. But its value isn't checked by the code before calling the function. I mark it as a bug to reproduce with a test and fix.
@emasab thanks. So theoretically I should be able to mitigate it for now by always waiting for the first rebalance to finish on our side right?
That should mitigate it. Thanks for the report!
Awesome. I will implement this strategy on our side then as a temporary measure.
@emasab
is NULL initially and it's only set after assignment.
Is is also set after an empty assignment? Can I assume, that the moment rb_cb kicks in, I'm good to go with the shutdown?
@emasab FYI I was able to repro + mitigate on my side. Thanks
Hey,
This is expansion of the report made here: https://github.com/confluentinc/librdkafka/issues/4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.
If you consider it part of https://github.com/confluentinc/librdkafka/issues/4308 please merge them :pray:
How to reproduce
Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:
the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.
Every single execution ends up with this:
I confirmed this behavior exists in the following librdkafka versions:
2.1.1
2.0.2
1.9.2
[x] librdkafka version (release number or git tag):
2.1.1
and all mentioned above[x] Apache Kafka version:
2.8.1
and3.4.0
from bitnami[x] librdkafka client configuration: presented in the above code snippet
[x] Operating system:
Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[x] Provide logs (with
debug=..
as necessary) from librdkafka[x] Provide broker log excerpts
[x] Critical issue - I would consider it critical, because the race-condition is not mentioned in the termination docs (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#termination) and on the contrary, they state that: "There is no need to unsubscribe"
Now let's dive deeper:
rd_kafka_unsubscribe
and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of runningrd_kafka_destroy
. The wait is needed despiterd_kafka_assignment
returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.rd_kafka_consumer_close
also partially mitigates this due to the fact, thatrd_kafka_consumer_close
will unsubscribe automatically. This may mitigate this on a long living consumers (ref: https://github.com/appsignal/rdkafka-ruby/issues/254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.rdkafka_sticky_assignor
is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of therdkafka_sticky_assignor
and it handing to rebalance callback however issue persists.close
, hence the probability of being in the rebalance state is lower (thought it can happen).rdkafka_assignor.c
nor inrdkafka_roundrobin_assignor
.rd_kafka_destroy
on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).Suggested fix
The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.
Logs
Here is the
debug
all
info tail (if you need more just ping me, I can generate it on the spot):and Kafka log matching this time: