Open olitomlinson opened 9 months ago
Going back to the original 2 original log message
encountered a retriable error while publishing a subscribed message to topic tenant-created, err: Post \"http://127.0.0.1:8087/subscription/tenant-created\": context canceled
: tracked down to here
Too many failed attempts at processing Kafka message: ds-applicationtenant-created/0/9 [key=]. Error: context canceled.
: tracked down to here
What does context canceled
mean as the detail here? It's not very helpful!
My guess is given the first log message, its trying to relay the message to the app channel, but the app channel is not available for some reason? maybe a connection timed out to the app channel?
So, assuming that the app channel goes away for some reason, maybe its under load / busy, and is unable to serve any more requests for a period of time, what is the recourse here?
My assumption is that if the app channel can't serve any more requests, and therefor the retries have exhausted, then there is little point of the sidecar remaining in the kafka ConsumerGroup. My guess is, It should gracefully eject its self from the consumerGroup, and then begin a process of attempting to rejoin, otherwise the sidecar just gets stuck with no further redelivery attempts.
Another assumption is that while all this is happening the Kafka client inside Dapr is still sending heartbeats to Kafka, even though it's not attempting any further redeliveries, hence why the sidecar hasn't been forcefully ejected from the consumer group by Kafka server.
Looks like there are a bunch of un-documented options for Kafka that allow configuring the retry behavior: https://github.com/dapr/components-contrib/blob/79adc565c17ad8936048896591cd205a6609ad67/common/component/kafka/kafka.go#L123-L128
I don't see this documented anywhere and it's unclear what the defaults are
@ItalyPaleAle it defaults to true :)
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.
do not close
This issue is pervasive in all of our pubsub components. Once a consumer errors out.. nothing reinits that consumer.
But the same is also true for Azure EventHubs https://github.com/dapr/components-contrib/issues/3325
@berndverst Ouch!
At the very least if a consumer errors out I would expect the dapr process/sidecar to exit so that it can be recycled by whatever orchestration is in place (k8s restarting the pod, being the common example)
Having a stalled subscriber particularly for things like Kafka is very painful indeed. (more painful than a traditional transactional broker like Service Bus)
@berndverst Ouch!
At the very least if a consumer errors out I would expect the dapr process/sidecar to exit so that it can be recycled by whatever orchestration is in place (k8s restarting the pod, being the common example)
Having a stalled subscriber particularly for things like Kafka is very painful indeed. (more painful than a traditional transactional broker like Service Bus)
Not sure what is going on there. Not a low hanging fruit after all. I will not be looking into this but contributions welcome.
@berndverst Is there any plan to address this issue? A stalled subscriber makes the Kafka connector practically unusable in a production environment. If the consumer errors out and doesn't recover automatically it severely impacts reliability. At the very least, the sidecar process should have some os.Exit
call when encountering a permanent error, allowing the pod to be restarted by Kubernetes. This would help maintain service availability until a permanent fix is implemented. Without this the Kafka connector cannot be considered production-ready.
I've also seen that Dapr's Kafka consumer uses Sarama, which has related issues https://github.com/IBM/sarama/issues/2621 https://github.com/IBM/sarama/issues/2682
Dapr runtime
1.12.2
on EKSWe have a Kafka PubSub subscriber, working fine, then we got the following 2 error message in the sidecar:
Processing Kafka message: ds-applicationtenant-created/0/9 [key=]"
encountered a retriable error while publishing a subscribed message to topic tenant-created, err: Post \"http://127.0.0.1:8087/subscription/tenant-created\": context canceled"
Too many failed attempts at processing Kafka message: ds-applicationtenant-created/0/9 [key=]. Error: context canceled.
Immediately after, the following logs were output (which looks like kafka client is attempting to reinitialise)
So, given that nothing had happened by this point, we decided to restart the subscribing pod.
Success! - the message is consumed immediately, interestingly it had expired (not sure if this is a red herring though)
So my question is, did the message expiring (via the TTL built into dapr Pubsub
ttlInSeconds
) cause the entire sidecar to stop consuming from that partition?My initial thinking was yes, it must be related to TTL...however the processing error occurred at
12:04:57Z
and the message didn't expire until12:21:09Z
So there is a 15 minute window there where the message had not yet expired, but the kafka client was not doing anything.So if its not the TTL, what is the problem?