Open alex-namely opened 5 years ago
i'm not sure based on the info provided. my next step would be to look at the broker logs for a clue as to why the commit is timing out.
@edenhill
Could it be caused by an issue described here: https://github.com/edenhill/librdkafka/issues/826#issuecomment-252901186 ?
We have been using _consumer.Commit(consumeResult)
which is sync but I feel like commiting every message might be the reason of exceptions we have been experiencing.
Yes, that does sound likely.
You can verify this by setting Debug = "protocol"
in your config and monitor the round trip time (rtt) of OffsetCommit Requests in the emitted debug logs
@alex-namely Do you have a solution for this issue ? As we are getting the same at the moment...
@alex-namely @edenhill I'm running into the same issue with a very high throughput consumer, was worked very well with the old version of librdkafka 0.11.6
now with the latest version1.3.0
we are getting this regression.
here is the consumer config
"enable.auto.commit": true,
"auto.commit.interval.ms": 100,
"enable.auto.offset.store": false,
we manually store the offsets after successfully processing
Librdkafka errors :
[thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60709ms, timeout #0)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60608ms, timeout #1)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60508ms, timeout #2)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60408ms, timeout #3)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60308ms, timeout #4)
%4|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out 8 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
@aouakki , @oleg-orlenko In my company we have been migrating everything to much more stable go-based client https://github.com/Shopify/sarama
@alex-namely - we see a lot of people migrating to the confluent go client from sarama for the same reason. the confluent go client is used heavily by some of the largest users of kafka. can't name names, but you're most likely using more than one product powered by it.
@aouakki - we're looking into this.
@aouakki If the commit rate is high and the commit rtt is higher than auto.commit.interval.ms there will be a build up of offset commits in-flight to the broker which will cause HOLB, delaying Heartbeats to the point where the consumer is kicked out of the group.
It'd be interesting to see what the OffsetCommitResponse rtt's are, you can see this by enabling "debug":"protocol"
on the consumer.
The auto.commit.interval.ms needs to be larger than the OffsetCommit rtt.
@mhowlett I confirm your point of view, we switched long time ago from Sarama to librdkafka for performance and the fact that sarama doesn't keep up with new Kafka features.
@edenhill thanks for your reply.
Let me put some context into this :
One of the services I maintain ( written in go ) is a heavy consumer, when I upgraded librdkafka to v1.3.0 it started lagging, digging down I found that the app hangs periodically and then resume consuming:
Looking to the above trace (almost 3 seconds data from prod app ) there's exactly 1s of interval between each time where the app starts processing.
The consumer is the first thing that get unblocked by a sysexit.
Looking to that period of hanging 1 second, I started searching into our configuration and I found that auto.commit.interval.ms
was set 1000ms so I put it to 100ms which resulted in the problem described above.
Historically we were running into a similar problem with the old version 0.11.6 and the solution was to adjust : socket.blocking.max.ms
but I guess we should not worry about this since it was deprecated in v1.0.0
I will get back to you with the protocol logs next week, will be glad to open a separate issue if you think this needs a new one.
i wonder if this is related: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1203
@mhowlett I don't think so, this is only on the consumer side, I have collected some librdkafka logs and it seems like the commit Operation is taking too mutch time, the RTT is always around 400~500 ms any idea what properties can be tuned to mitigate this behaviour ?
Thanks for the logs, it does not look like head-of-line-blocking (HOLB) since there is only one outstanding commit request at the time on the GroupCoordinator connection.
Are you looking at the broker side request metrics? What is the CPU load on the client?
Nothing to mention on the broker side, all the metrics are normal. on the consumer side , I do see a drop in all the host metrics : cpu, memory and network in/out. Note that the broker version is : 0.10.2.1
Can you check the broker request metrics for OffsetCommit's TotalTime, et.al, is there any difference when using v0.11.6 vs v1.3.0 in these metrics?
I am using Broker inter.broker.protocol.version=2.4-IV1 and the Consume hangs forever with v1.3.0. I am able to connect with the Conducktor UI tool and I see I have data, but NOTHING works. I really could use some help.
@devgig Please reproduce this with Debug: "all"
enabled and provide us the logs.
Hi @edenhill! We're also hitting this issue. In this case it looks like HOLB but I don't have metrics/log on the consumer side to confirm this. We enabled "debug":"protocol"
but ~I couldn't find any log on the OffsetCommitResponse RTT~ (it is there but not during the incident window). It is usually ~300ms. Also, where do I find the number of outstanding commit request?
I said that it looks like a HLOB because the consumer is set to auto.commit.interval.ms=1000
and on the broker side we find that the p99 for the offsetCommit's totalTime takes up to 3 sec when we see the issue. Time aligns perfectly. The broker does not look loaded at all though. What do you suggest we can do to understand more what the client is doing at that time? In the meantime, we'll increase auto.commit.interval.ms
to 5s.
Is this still an issue with the latest client?
Description
Hello, We have been using the latest Kafka client library (1.2.0) with defaults settings. Our typical Kafka topic consumption loop is to read an event and commit it one by one. Recently we have noticed a lot of random
Broker: Unknown member
exceptions while commiting event offset.Logs says:
then this:
And finally this happens (because of rebalancing)
I'm wondering why we see this
preceeding OffsetCommitRequest
if we just commit offsets one by one sequentially.Could you please help to figure out what is happening?
How to reproduce
NuGet packages installed:
<PackageReference Include="Confluent.Kafka" Version="1.2.0" />