Open abarisain opened 1 year ago
The issue could be link to nodejs see pull: https://github.com/nodejs/node/pull/48943
Nice, thanks. I'll check out node 21
Actually node21 dont have nan support yet, on Linux in my case this here is called by uv_async_io: https://github.com/Blizzard/node-rdkafka/blob/master/src/workers.h#L156 then first HandleMessageCallback from WorkMessage, then HandleMessageCallback enters callback->Call but the callback is v8::PersistentBase with value nullptr, then in nan.h 1810 carries the null pointer to v8-local-handle.h => and crash. The race in libuv should be fixed in 20+ where node carries 1.45.0+(atomic load) but node 18 seems to be on earlier (busy wait spin) however it is not node issue. This is because of the worker.WorkComplete() added in kafka_consumer.cc/NodeDisconnect in 2.16.1 , can be reproduced on double disconnect or pause and disconnect, connect/disconnect at the same time.
Another note here is that if getMetadata fails it will call disconnect on its own here: https://github.com/Blizzard/node-rdkafka/blob/master/lib/client.js#L165 so if you call disconnect as well it will cause double disconnect or could be GC on javascript side, since v8::Persistent can be GC if the callback passed goes out of scope.
The fix would be to check (callback && !callback->IsEmpty()) here: https://github.com/Blizzard/node-rdkafka/blob/master/src/workers.cc#L770 as that can still run after worker->WorkComplete()
CC: @GaryWilber , @iradul
Environment Information
Steps to Reproduce
Hello,
I'm writing automated tests with playwright and need to access kafka from them. This implies connecting and disconnecting many times from multiple brokers during the life of a process.
My consumer wrapper returns a promise that resolves once a message has been consumed on any partition. Before resolving, I disconnect from the brokers as the connection will not be reused.
When I try consuming multiple topics from multiple brokers at once and disconnect, I experience crashes:
(I can't seem to be able to get unmangled symbols but I'd be happy to help)
It's a tough one to reproduce, as it seems to only happen if you have
resolve
ANDconsumer.disconnect()
in theconsumer.on("message")
callback. Moving stuff around seems to fix the issue, but unfortunately I can't really organize my code in another way, nor ask people who write tests to be careful about their sequence. It really looks like a race condition. Pausing consumption or unassigning topics doesn't seem to work either.I tried waiting for the "disconnected" event before resolving the promise to make sure, but that did not work either. Plus, it sometimes seem to fail to disconnect and hang. Pausing consumption didn't work either.
The only workaround I found is to tolerate a memory leak by going into
node-rdkafka/src/kafka-consumer.cc
and comment this out:Not ideal, but test processes don't stay up for long so it's alright.
I have not yet tried to reproduce this on a Linux computer (I will only be able to test on ARM linux) or on brokers that are not part of our development environment.
I did manage to isolate a small repro case outside of playwright. Pay no attention to most of
startConsuming
, which is there to assign myself all partitions of a topic. You might need to give it a couple of attempts as it will sometimes just work.package.json:
index.js: