Open jgn-epp opened 1 month ago
Can you please provide some debug logs during the time this issue happened
Callback to refresh oauth token not invoked at all if newly created consumer perform QueryWatermarkOffsets as the first operation.
Case1: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.QueryWatermarkOffsets -> Confluent.Kafka.KafkaException : Local: All broker connections are down
Case2: builder.SetOAuthBearerTokenRefreshHandler -> builder.Build -> consumer.Consume -> consumer.QueryWatermarkOffsets -> work as expected
Case3: config.SaslOauthbearerMethod = SaslOauthbearerMethod.Oidc -> consumer.QueryWatermarkOffsets -> work as expected
Description
We are using SASL OAuth for authentication and most of the time it works fine. However occasionally the token will expire and the refresh function (passed into the
SetOAuthBearerTokenRefreshHandler
method) will not get fired, leading to the consumer getting stuck in an error state where the following message is emitted over and over:When the consumer gets into this state, it never tries to recover by trying to refresh its token again and the consumer is basically dead, requiring the application to be restarted.
We have multiple environments and seemingly this only happens in some of the consumers in environments with low traffic in the topic - at least I have never seen the issue in the environments with plenty of traffic. So maybe this is an issue with too many partitions compared to the message throughput, causing the consumer to be stuck enough time in the
Consume
method call that the OAuth refresh function isn't called before it runs into the current token being expired? At least that is my guess, since once this issue happens, the partitions assigned to the now dead consumer are re-assigned to other consumers and then there is seemingly enough traffic on those consumers that they can keep running without having this issue. Originally we were using theConsume(CancellationToken)
overload but tried using theConsume(TimeSpan)
overload with a 30 second time out, but eventually the issue is still happening.So there definitely seems to be some bug here causing the OAuth token fresh handler not to fire if the
Consume
call is blocking for too long and actually I would also maybe expect that if the consumer starts getting theJWT_EXPIRED
error, it should maybe default to trying to refresh its OAuth token instead of staying in an error state and never trying to recover.We are using version 2.4.0 of the Confluent.Kafka nuget.
How to reproduce
I cannot tell you exactly how to produce it as I haven't been able to do so, but I can tell that this issue is happening in topics with a low amount of traffic where the consumer is spending a lot of time in the blocking
Consume
method.Checklist
Please provide the following information: