confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
91 stars 869 forks source link

SASL OAuth is sometimes not refreshed and application gets stuck #2329

Open jgn-epp opened 1 month ago

jgn-epp commented 1 month ago

Description

We are using SASL OAuth for authentication and most of the time it works fine. However occasionally the token will expire and the refresh function (passed into the SetOAuthBearerTokenRefreshHandler method) will not get fired, leading to the consumer getting stuck in an error state where the following message is emitted over and over:

SASL authentication error: {\"status\":\"JWT_EXPIRED\"} (after 5054ms in state AUTH_REQ, 6 identical error(s) suppressed)

When the consumer gets into this state, it never tries to recover by trying to refresh its token again and the consumer is basically dead, requiring the application to be restarted.

We have multiple environments and seemingly this only happens in some of the consumers in environments with low traffic in the topic - at least I have never seen the issue in the environments with plenty of traffic. So maybe this is an issue with too many partitions compared to the message throughput, causing the consumer to be stuck enough time in the Consume method call that the OAuth refresh function isn't called before it runs into the current token being expired? At least that is my guess, since once this issue happens, the partitions assigned to the now dead consumer are re-assigned to other consumers and then there is seemingly enough traffic on those consumers that they can keep running without having this issue. Originally we were using the Consume(CancellationToken) overload but tried using the Consume(TimeSpan) overload with a 30 second time out, but eventually the issue is still happening.

So there definitely seems to be some bug here causing the OAuth token fresh handler not to fire if the Consume call is blocking for too long and actually I would also maybe expect that if the consumer starts getting the JWT_EXPIRED error, it should maybe default to trying to refresh its OAuth token instead of staying in an error state and never trying to recover.

We are using version 2.4.0 of the Confluent.Kafka nuget.

How to reproduce

I cannot tell you exactly how to produce it as I haven't been able to do so, but I can tell that this issue is happening in topics with a low amount of traffic where the consumer is spending a lot of time in the blocking Consume method.

Checklist

Please provide the following information:

anchitj commented 1 month ago

Can you please provide some debug logs during the time this issue happened

IharYakimush commented 6 days ago

Callback to refresh oauth token not invoked at all if newly created consumer perform QueryWatermarkOffsets as the first operation.