confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
2.78k stars 847 forks source link

OffsetCommitRequest timeout causes consumers rebalancing #1081

Open alex-namely opened 4 years ago

alex-namely commented 4 years ago

Description

Hello, We have been using the latest Kafka client library (1.2.0) with defaults settings. Our typical Kafka topic consumption loop is to read an event and commit it one by one. Recently we have noticed a lot of random Broker: Unknown member exceptions while commiting event offset.

Logs says:

{"Message":"[thrd:GroupCoordinator]: 
GroupCoordinator/3: Timed out HeartbeatRequest in flight (after 10963ms, timeout #0): possibly held back by preceeding OffsetCommitRequest with timeout in 48457ms",
"ClientInstance":"rdkafka#consumer-1","Facility":"REQTMOUT"} 

then this:

{"Message":"[thrd:GroupCoordinator]: 
GroupCoordinator/3: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests",
"ClientInstance":"rdkafka#consumer-1","Facility":"REQTMOUT"} 

And finally this happens (because of rebalancing)

Broker: Unknown member ---> 
Confluent.Kafka.KafkaException: Broker: Unknown member\n 
at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)\n at 
Confluent.Kafka.Consumer`2.Commit(ConsumeResult`2 result)

I'm wondering why we see this preceeding OffsetCommitRequest if we just commit offsets one by one sequentially.

Could you please help to figure out what is happening?

How to reproduce

NuGet packages installed: <PackageReference Include="Confluent.Kafka" Version="1.2.0" />

while (true)
{
                consumeResult = _consumer.Consume(500ms);
                if (consumeResult == null)
                {
                    return;
                }
                _consumer.Commit(consumeResult);
}
mhowlett commented 4 years ago

i'm not sure based on the info provided. my next step would be to look at the broker logs for a clue as to why the commit is timing out.

alex-namely commented 4 years ago

@edenhill

Could it be caused by an issue described here: https://github.com/edenhill/librdkafka/issues/826#issuecomment-252901186 ?

We have been using _consumer.Commit(consumeResult) which is sync but I feel like commiting every message might be the reason of exceptions we have been experiencing.

edenhill commented 4 years ago

Yes, that does sound likely. You can verify this by setting Debug = "protocol" in your config and monitor the round trip time (rtt) of OffsetCommit Requests in the emitted debug logs

oleg-orlenko commented 4 years ago

@alex-namely Do you have a solution for this issue ? As we are getting the same at the moment...

aouakki commented 4 years ago

@alex-namely @edenhill I'm running into the same issue with a very high throughput consumer, was worked very well with the old version of librdkafka 0.11.6 now with the latest version1.3.0 we are getting this regression.

here is the consumer config

"enable.auto.commit":      true,
"auto.commit.interval.ms": 100,
"enable.auto.offset.store": false,

we manually store the offsets after successfully processing

Librdkafka errors :

[thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60709ms, timeout #0)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60608ms, timeout #1)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60508ms, timeout #2)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60408ms, timeout #3)
%5|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out OffsetCommitRequest in flight (after 60308ms, timeout #4)
%4|1583484135.978|REQTMOUT|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/1284: Timed out 8 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
alex-namely commented 4 years ago

@aouakki , @oleg-orlenko In my company we have been migrating everything to much more stable go-based client https://github.com/Shopify/sarama

mhowlett commented 4 years ago

@alex-namely - we see a lot of people migrating to the confluent go client from sarama for the same reason. the confluent go client is used heavily by some of the largest users of kafka. can't name names, but you're most likely using more than one product powered by it.

@aouakki - we're looking into this.

edenhill commented 4 years ago

@aouakki If the commit rate is high and the commit rtt is higher than auto.commit.interval.ms there will be a build up of offset commits in-flight to the broker which will cause HOLB, delaying Heartbeats to the point where the consumer is kicked out of the group.

It'd be interesting to see what the OffsetCommitResponse rtt's are, you can see this by enabling "debug":"protocol" on the consumer.

The auto.commit.interval.ms needs to be larger than the OffsetCommit rtt.

aouakki commented 4 years ago

@mhowlett I confirm your point of view, we switched long time ago from Sarama to librdkafka for performance and the fact that sarama doesn't keep up with new Kafka features.

@edenhill thanks for your reply.

Let me put some context into this :

One of the services I maintain ( written in go ) is a heavy consumer, when I upgraded librdkafka to v1.3.0 it started lagging, digging down I found that the app hangs periodically and then resume consuming:

image

Looking to the above trace (almost 3 seconds data from prod app ) there's exactly 1s of interval between each time where the app starts processing.

The consumer is the first thing that get unblocked by a sysexit.

Looking to that period of hanging 1 second, I started searching into our configuration and I found that auto.commit.interval.ms was set 1000ms so I put it to 100ms which resulted in the problem described above. Historically we were running into a similar problem with the old version 0.11.6 and the solution was to adjust : socket.blocking.max.ms but I guess we should not worry about this since it was deprecated in v1.0.0

I will get back to you with the protocol logs next week, will be glad to open a separate issue if you think this needs a new one.

mhowlett commented 4 years ago

i wonder if this is related: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1203

aouakki commented 4 years ago

@mhowlett I don't think so, this is only on the consumer side, I have collected some librdkafka logs and it seems like the commit Operation is taking too mutch time, the RTT is always around 400~500 ms any idea what properties can be tuned to mitigate this behaviour ?

edenhill commented 4 years ago

Thanks for the logs, it does not look like head-of-line-blocking (HOLB) since there is only one outstanding commit request at the time on the GroupCoordinator connection.

Are you looking at the broker side request metrics? What is the CPU load on the client?

aouakki commented 4 years ago

Nothing to mention on the broker side, all the metrics are normal. on the consumer side , I do see a drop in all the host metrics : cpu, memory and network in/out. Note that the broker version is : 0.10.2.1

edenhill commented 4 years ago

Can you check the broker request metrics for OffsetCommit's TotalTime, et.al, is there any difference when using v0.11.6 vs v1.3.0 in these metrics?

devgig commented 4 years ago

I am using Broker inter.broker.protocol.version=2.4-IV1 and the Consume hangs forever with v1.3.0. I am able to connect with the Conducktor UI tool and I see I have data, but NOTHING works. I really could use some help.

edenhill commented 4 years ago

@devgig Please reproduce this with Debug: "all" enabled and provide us the logs.

scanterog commented 3 years ago

Hi @edenhill! We're also hitting this issue. In this case it looks like HOLB but I don't have metrics/log on the consumer side to confirm this. We enabled "debug":"protocol" but ~I couldn't find any log on the OffsetCommitResponse RTT~ (it is there but not during the incident window). It is usually ~300ms. Also, where do I find the number of outstanding commit request?

I said that it looks like a HLOB because the consumer is set to auto.commit.interval.ms=1000 and on the broker side we find that the p99 for the offsetCommit's totalTime takes up to 3 sec when we see the issue. Time aligns perfectly. The broker does not look loaded at all though. What do you suggest we can do to understand more what the client is doing at that time? In the meantime, we'll increase auto.commit.interval.ms to 5s.

anchitj commented 3 weeks ago

Is this still an issue with the latest client?