Closed jrots closed 5 years ago
In my case a similar problem happens, when my instance is unable to connect to a broker instance (while testing I'm simply stopping kafka docker container). This causes Producer to retry sending the message over and over, locking up php-fpm in the process.
For normal cases when Kafka is online, it seems that after message is acknowledged process is freed properly.
I think that the problem lies in an "infinite" loop inside kafka_free()
function that is called when destructing the PHP object: https://github.com/arnaud-lb/php-rdkafka/blob/master/rdkafka.c#L85
The loop looks similar to the one in librdkafka's rd_kafka_flush()
function: https://github.com/edenhill/librdkafka/blob/9b3fce7b882b43302fb983d0e0e555225e672f92/src/rdkafka.c#L3790
but rd_kafka_flush()
checks not only whether the output queue is empty but it also checks for a timeout condition.
I think that polling in the destructor is not the best solution and it would be better to expose the rd_kafka_flush()
function so that users can call it before the PHP application/script ends. The reason is that in order to do any error handling for undelivered messages, one needs to set the dr_msg_cb
(message delivery callback) which gets called while polling. When the polling happens in the PHP object destructor, it's usually too late to do any error handling. For example, in my Symfony app I am unable to log the error using Monolog because its logger object is already destroyed at that time.
Currently, as a workaround I added custom flush()
method to the \Enqueue\RdKafka\RdKafkaProducer
class (because I am using the \Enqueue\RdKafka
wrapper):
public function flush(int $timeoutInMilliseconds): void
{
// Make at least one non-blocking poll
$this->vendorProducer->poll(0);
$endTime = microtime(true) + $timeoutInMilliseconds / 1000;
$remainingTimeInMilliseconds = $timeoutInMilliseconds;
while ($this->vendorProducer->getOutQLen() > 0 && $remainingTimeInMilliseconds > 0) {
$this->vendorProducer->poll($remainingTimeInMilliseconds);
$remainingTimeInMilliseconds = (int) (($endTime - microtime(true)) * 1000);
}
if ($this->vendorProducer->getOutQLen() > 0) {
throw new \RuntimeException('Flush timed out');
}
}
Then I can call the flush()
method on kernel.terminate
event where I can do some error handling.
Thanks for the analysis @sustmi
The current behavior (poll until the queue is empty) is safe by default, as it will attempt to send the messages until they are actually sent. I think that this is a good behavior to retain.
Though I agree that there should be a way to quit earlier. If we expose rd_kafka_fllush()
and rd_kafka_purge()
, it would be possible to do something like this:
$rk->flush($timeout);
$rk->purge(); // After this, the destruction would be non-blocking
WDYT ?
@arnaud-lb building on what was previously mentioned, i think it would be important, that at least all the callbacks that can be triggered from poll, can be executed still (so the application could handle errors). If this is not possible, i would actually get rid of the poll in the destructor and adjust all the examples, and really stress that we need to do flush and purge for a proper shutdown. If it is guaranteed that all the callbacks still will be triggered during the destruct and the app can still handle them, i think it would be ok to leave it in. What do you think?
Unfortunately, there is no guarantee that callbacks are going to be called during destruction, because the callbacks themselves might have been destroyed by the GC already at this point.
Here are options I can think of:
Option 1: Don't poll in the destructor, and let users poll/flush manually. If the user forgets to do so, messages are lost.
Option 2: Poll in the destructor, and let users poll/flush/purge manually before destruction if they need to ignore unflushed messages. This is the current behavior.
Option 3: Add a setting on RdKafka\Producer
to chose between the two behaviors.
@arnaud-lb in my opinion we should go with Option 1 (we could always adapt if we see, that too many people have problems and add Option 2 again). My reason for option one is, if we don't poll / flush, we can actually anyway never be sure if poll will trigger a callback that is unhandled, which actually then could result in lost messages as well. So i think it is not the extensions responsibility to take care of this, since it is very hard to acomplish in a good way. If you are ok with Option 1, i would adjust readme and doc, to stress and show how a proper shutdown should be done of course.
Option 1: Don't poll in the destructor, and let users poll/flush manually. If the user forgets to do so, messages are lost.
Simple solutions are probably best.
This option gives programmer-user the most control in my opinion and should be preferred. It is explicit and shows exactly what is happening. Current behavior is difficult to explain to newcomers, since it involves "hidden" operations that may or may not occur (message is not able to be delivered).
Actually, I believe it will prevent more issues than it might introduce, since what usually happens is programmer-user does no error checking and is eventually surprised when inevitable connectivity issues occur and PHP processes start to chog, retrying until destroyed and/or lose messages.
Enforcing proper error handling in user-land is the way to go for me.
Agreed, the current behavior is confusing. Let's go for option 1. We have to document this change, and release this in a new major version, though.
Hi,
is there an update on the indefinite hangs in destructor when no brokers are available ?
I tried several options but can't seem to get over this issue.. it's affecting my production environment atm when kafka is restarting or all the brokers are down
you can retry it very easily with following code:
Afterwards rdkafka will be in a destructor loop :
Thanks! Regards J