Closed hunyadi-dev closed 2 years ago
Make sure that you have destroyed all other objects, topics, messages, etc, before calling destroy().
I have printouts that show the deletion order of librdkafka handles (and paired up every handle with its proper destroy call as std::unique_ptr
deleters):
Invoked topic deleter.
Invoked producer deleter
Producer_deleter done
Invoked topic partition list deleter.
Invoked consumer deleter.
Closing consumer connections...
Rebalance triggered.
revoked:
kf_topic_partition_list: [topic: ConsumeKafkaTest, partition: 0, offset: -1001]
Attempting to destroy consumer...
(stuck here)
Could you try latest librdkafka master branch, we have this fix 650799afdaf8c7c810f16df5b902b809fec17ffd that is of interest.
I tried the latest commit and the issue is still present. Also on the current HEAD
some of my tests fail due to consumer rebalance not being triggered the first time rd_kafka_consumer_poll
is called.
@edenhill, @hunyadi-dev , I am hitting a similar issue with one of the broker thread destroy blocked on refcnt 3 but the state is in INIT.
Level: Debug, Message: [thrd:25.107.195.133:9092/15]: 25.107.195.133:9092/15: Handle is terminating in state INIT: 3 refcnts (0000020077693340), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
@hunyadi-dev
The message in the "Resetting offsets manually" loop was not destroyed, adding that (and increasing the poll timeout so a rebalance always happens) fixed the hang.
This was on master.
@ajbarb Please try to reproduce on latest master, make sure that all outstanding objects (messages in particular) are destroyed before destroying the consumer, and if still an issue, please provide a reproducible test case.
I am also observing this problem in rather regular manner (currently testing on rdkafka 1.9.2 but saw it also on some older versions) .
Haven't managed to create simple example yet unfortunately (for one reason or another I fail to reproduce it on simple examples and actual problem happens on rather complex app), but some observations:
I started to see it more often since I started to compile my code with sanitizer (ASAN+LSAN) and even more often since I started to set ASAN_OPTIONS =fast_unwind_on_malloc=0
. Both things are likely not direclty related (esp. considering rdkafka itself is not asanized) but they have simple impact I noticed also in some other cases: threads dynamic changes, various things (malloc/free!) are slower and much more likely to allow context switch.
IIRC I never got it on „pure consumer”, I get it on app which uses (within single process) a few producers and single consumer
May matter or not that my code is using temporary topic which is removed earlier.
Picture of the stuck process in the debugger:
my main thread is deleting RdKafka::Consumer object what results in rd_kafka_destroy_flags
call what finally waits on thrd_join
here https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka.c#L1105 (I suppose it waits for next one)
rdk:main
thread is waiting on rd_kafka_destroy_internal
here: https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka.c#L2118 which in turn also waits on some thrd_join
: https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka.c#L1268 (I suppose next one)
rdk:broker100
thread is the only one which does something, it is just repeatably calling rd_kafka_broker_serve
and logging Handle is terminating in state INIT: 3 refcnts
once per milisecond, in general it repeatably runs https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka_broker.c#L5261 (I tried setting breakpoint, it just loops here repeatably calling that, it gets no results, then below it logs here https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka_broker.c#L5402 and … enters loop from scratch
So the process logs „Handle is terminating in state INIT” indifinitely (I once left it for 4 days accidentally) and nothing else happens.
The latter may be of some importance: rd_kafka_terminating
returns true but doesn't break the loop, rd_kafka_broker_terminating
returns false as handle is in state INIT, so the loop continues.
I will try to grab some detailed debugging.
One general idea before digging deeper: mayhaps rd_kafka_broker_thread_main
could detect that it remains in that state (rd_kafka_terminating
returns true) for a long time and finally break in such a case, as safety-protection, even if rd_kafka_broker_terminating
returns false.
PS I don't suppose my code is leaking something, at least something I am aware of – I am rather dogmatic in using smartpointers and RAII objects, also I run simpler apps based on the same library and approach under sanitizer which doesn't report leaks.
I attach full log of such stuck process (it is formatted using my custom format which may be of use as it shows timestamps and thread numbers but text is straight from rdkafka). This process produced some messages (to permanent topics, those APCTST*), consumed some (from temporary topic apctst-replyq-… which it creates and removes, in general this is some RPC emulation) and tries to shut down.
Some observations: …
… „source of evil” is mayhaps in the thread change from DOWN back to INIT. It happens twice – two threads log Broker changed state DOWN -> INIT
:
one of them (140102097798720
) later logged Received TERMINATE op in state TRY_CONNECT
, then Broker changed state TRY_CONNECT -> DOWN
and looks like it happily ended
another is the problematic one, 140102064227904
here, it never got such notification and simply stays stuck in this INIT state.
If I grep correctly, this change to INIT must have happened here: https://github.com/edenhill/librdkafka/blob/v1.9.2/src/rdkafka_broker.c#L5277
If I undestand correctly, we go back from DOWN to INIT because there are some remaining refcnts. But I have no clue what could those be.
My suggestion: add some time limit for such state (rd_kafka_terminating but not rd_kafka_broker_terminating). If it lasts - say - minute (preferably configurable), give up and end broker in spite of extra refcnt's.
Then a) app won't be stuck forever anymore b) it will be possible to detect what exactly was leaked via sanitizer/valgrind/… (or if leaked ref causes crash, it will also be possible to detect what was accessed too late)
(this suggestion is valid even if I finally find some leak of my fault somewhere – such errors generally happen and … leak detectors usually require apps to finish before they find out leaks).
patch.txt (or the same in easier to read form: https://gist.github.com/Mekk/ed05c9fc95196bfb4aac5629375295e0 )
Rough patch implementing the idea above (breaks broker thread loop in case more than minute passes since rd_kafka_terminating returned true). I don't think it is applicable as-is but with some tuning (like making this timeout configurable) it may be reasonable. Diff made against v1.9.2
With rdkafka patched this way I faced problem a few times and in all of them app finished cleanly after this minute. What is important: sanitizer didn't detect any memory leaks, what may be confirming my claim that I don't leak anything (unless rdkafka makes some bulk recursive resource release).
For comparison, example log from patched rdkafka (final part shows rather clean completion). As I said, app was run under ASAN+LSAN and didn't report any memory leaks. I can't exclude possibility that my app still had some variable released later (will look for that) but it isn't anything trivial (unless my logging callback counts, but this one really should work to the very end).
To summarize, I'd really suggest applying change of such kind – even if it is app fault (I am not sure, but maybe), current behaviour is very unfriendly, one gets indifinitely stuck app with no realistic way to find what the problem is.
Eureka!
Looks like in my case the problem was caused by RdKafka::Message object which sometimes was still alive (app works as pipeline, one thread consumes incoming messages, other threads process them). So - well, it was „my fault”.
But I was able to find it out only after using the patch above, as only then I could proceed and see (my) logs emitted „afterwards” and only then I could set breakpoints. Earlier it was practically impossible to debug the situation. So, in the interest of people facing similar problem, I'd really recommend applying such or similar safety-valve.
Mayhaps it would also make sense to make those refcnts a bit more subtle (count them separately per type) – if stuck app logged „3 refcnts: 2 messages, 1 connection” instead of „3 refcnts” it would be much easier to look for possible reason of the problem.
PS The fact that bare existence of received message object keeps connection from being closed is also by far non-obvious. Is it really necessary?
Glad you found it!
I agree that troubleshooting this should be easier and that the debug logs are not very helpful, remaining object counts would indeed help.
As for allowing asymmetrical destruction of objects; no, this will open an absolute can of wormpain.
There is a strict contract in that librdkafka is completely done with the client instance once rd_kafka_destroy()
returns.
If we would allow objects referencing the client instance to be alive after rd_kafka_destroy()
returns that contract would be broken, and it would be very hard to reason about correctness.
@edenhill would the same apply to the JSON string that is being published in the statistics cb?
Could its presence also block the destroy?
That particular callback has a return value for managing memory ownership, return 1 if you want to own the memory and free it (with rd_kafka_mem_free(NULL, ptr)) at your own discretion.
Once we know we are stuck (and if a couple of minutes passed since we started to close connection and we can't proceed in any way and nothing is happening – we know that we are stuck and no change of this state can be expected) generally any behaviour would be more helpful. Including terminate()
, kill(self, -9)
, _exit(1)
or anything. Although allowing app to proceed with the destruction (after sufficiently alarming warning) is more friendly.
Crashed program will be restarted. Stuck program will stay being stuck. It usually won't work, but it will remain there. For a long time, possibly, especially if we have complicated cloud setup with no direct admin supervision.
So I really suggest that this scenario (close/destroy of the handle with some remaining refcnts which don't disappear in spite of prolonged wait) deserves reconsideration. At the moment rdkafka detects this scenario (OK, almost detects, full detection is very close as my patch illustrates) and opts to resolve it with equivalent of while(true) {sleep(shorttime);}
what is in fact very unfriendly towards both admin (who may face stuck processes which don't log any problems but simply don't stop) and developer (who, if manages to reproduce the problem and locate it in debugger, is left with two threads infinitely-waiting for thread joins and unclear rdkafka thread looping) way to resolve the problem.
Of course my patch is oversimplistic and should be improved (for example by checking that really nothing changes = change of refcnt or any event resets deadline, by aforementioned configurability of this emergency timeout, by better warning/error message, maybe by some additional criteria, maybe by more brutal way to act) but it simply breaks this fatal loop.
My scenario (of attempt to close consumer object while there still exists not-yet-destroyed message somewhere) isn't that exotic and unlikely. Esp. considering strong asynchronicity of things. Scenario of accidental leak of message object also may happen (such problems are relatively easy to find with valgrind or LSAN … but only if the program finishes).
OK, I probably start repeating myself, so that's it. But please, consider those cases and the reaction on them.
Description
When using the C API for implementing a kafka consumer, I think I am following the required termination sequence properly, but when I call
rd_kafka_destroy
on my consumer handle, it tends to hang up and never return.When enabling the
debug: all
setting these are the last few logs before the issue happens:How to reproduce
I prepared a github repo here with a small example project: https://github.com/hunyadi-dev/librdkafka_demo
Steps to build and run:
Checklist
Librdkafka version: v1.5.0 (tried and is present on v1.5.2 as well) Apache Kafka version: 2.6.0 (Commit:62abe01bee039651) Operating system: macOS Catalina, version 10.15.7 (19H2) Broker log excerpt:
Critical issue: no
<REPLACE with e.g., v0.10.5 or a git sha. NOT "latest" or "current">
<REPLACE with e.g., 0.10.2.3>
<REPLACE with e.g., message.timeout.ms=123, auto.reset.offset=earliest, ..>
<REPLACE with e.g., Centos 5 (x64)>
debug=..
as necessary) from librdkafka