Closed noamzafrir closed 2 years ago
I did not gather whether this is an idempotent producer. If yes, are you checking for fatal errors? Not checking was a mistake I had made.
Hi @larry-cdn77, Thanks you for the quick response. I am using a regular producer, not idempotent. Can you elaborate more about the error you experienced. Perhaps it's relevant to me as well.
The only disconnect I see is the bootstrap connection timing out, which makes sense since it is typically not used after initial cluster connection. The other connection that's used for fetching, broker 1, is indeed still up, so I don't see anything wrong in the logs?
Hi @edenhill, I agree there's nothing wrong in the logs. As I said, I am deliberately failing the broker so this behavior is expected. Question is, how can I identify this "Connection Timed Out" on the client side and do something to recover. Also, which parameter defines this timeout? I see in the log that this timeout was triggered after 130000ms but I haven't seen any parameter that is configured to 130000. Thanks.
Connection timed out (after 130037ms in state CONNECT)
means that it has been connected for 130037ms, and then disconnected
The client will reconnect automatically when necessary. You don't need to do anything on disconnect. also see https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected
Connection timed out (after 130037ms in state CONNECT)
means that it has been connected for 130037ms, and then disconnected
Yes, I understand, but which configuration parameter is responsible to set 130000ms?
The client will reconnect automatically when necessary. You don't need to do anything on disconnect. also see https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected
Sorry, but that's not what I see in the log. Up until the "Timed Out" message indeed I see re connection attempts, but after the this message I don't see it. From the link you've shared I understand that the "Local: Timed Out" I'm experiencing is actually an indication that the round trip defined for communication between client and broker is exceeded. This is OK given that the broker is down. So question remains, how can I configure my client to keep trying to reconnect to the broker? Is there a specific parameter that limits the number of re connection attempts?
Yes, I understand, but which configuration parameter is responsible to set 130000ms?
Typically the broker's idle connection reaper, or a load-balancer's.
So question remains, how can I configure my client to keep trying to reconnect to the broker?
It will reconnect automatically as necessary. If it is disconnected from a broker it will only reconnect if it needs to.
Is there a specific parameter that limits the number of re connection attempts?
No.
Are you seeing an actual problem, or just asking about the connection behaviour?
Thank you for the detailed answer @edenhill. As I mentioned at the beginning of this discussion, I want to make sure that my Kafka client can recover after a broker failed and came back up (either immediately back up or after some time). I installed the latest stable version of librdkafka - 1.6.2 (currently using 1.2.1 which is the native version of Ubuntu 20.04). With the new version it looks like the client is recovering after 5-6 minutes, but I want to double check that i's consistent. Is it possible that this problem was not handled well in version 1.2.1? Is 5-6 minutes a reasonable time for recovery?
1.8.2 is the latest version of librdkafka.
librdkafka will automatically try to recover from all errors.
The recovery time depends on what problem the client is recovering from, so please provide some more details.
I will try version 1.8.2.
With regards to problem - I am emulating a problem of broker failure: I have 2 brokers running on separate Kubernetes (K8S) nodes and I am failing each of them one after the other. The failure I am applying has 2 parts:
As for version 1.8.2, it is stated here that version 1.6.2 is the latest (not 1.8.2) It is also stated that version 1.8.2 is from Oct 21 while 1.6.2 is from Nov 21. Can you please clarify that?
As for version 1.8.2, it is stated here that version 1.6.2 is the latest (not 1.8.2). It is also stated that version 1.8.2 is from Oct 21 while 1.6.2 is from Nov 21. Can you please clarify that?
It can be a little confusing to see the maintenance version at the top of the list but if you scroll down on that releases page, 1.8.2 will also show up. Hope this helps.
Yes, @larry-cdn77, I scrolled down and found version 1.8.2. However, it is still stated that 1.6.2 is the latest. Hence my confusion.
Hello noamza , i am facing the same issue can you please tell what fixed your error
@noamzafrir Hi, I am facing this issue Now !
02915469.895|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1702915469.895|FAIL|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: 1 request(s) timed out: disconnect (after 1563776420ms in state UP)
%4|1702915499.928|FAIL|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: Connection setup timed out in state CONNECT (after 30032ms in state CONNECT)
After this error happens no re-connect attempts are made. Librdkafka version: V2.1.0 any idea?
Getting same error and after this Kafka client(Consumer code) not able to connect again, only option is to restart the pods(service). Kindly let me know of there is any way to configure consumer code to re-initiates the connection
%4|1710193417.943|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://brokeraddress.amazonaws.com]: sasl_ssl://b-3.msk-wbrokeraddress.amazonaws.com:9096/3: Connection setup timed out in state APIVERSION_QUERY (after 29924ms in state APIVERSION_QUERY, 1 identical error(s) suppressed) %4|1710193447.946|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://b-2.brokeraddress.amazonaws.com]: sasl_ssl://b-2.msk-wesbroker.amazonaws.com:9096/2: Connection setup timed out in state APIVERSION_QUERY (after 29912ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
I got the same error. After a Kafka server maintenance, the consumer stopped consuming from one partition only. The only solution is to restart the service.
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Do NOT create issues for questions, use the discussion forum: https://github.com/edenhill/librdkafka/discussions
Description
I am deliberately failing my Kafka brokers in order to test my system. I am running with Kubernetes and I apply 2 types of failures:
|FAIL|rdkafka#producer-2| [thrd:metro-kafka-bootstrap:9092/bootstrap]: metro-kafka-bootstrap:9092/bootstrap: Connect to ipv4#10.43.107.45:9092 failed: Connection timed out (after 130037ms in state CONNECT)
Until this error I see that librdkafka tries to re-connect to the broker. After this error happens no re-connect attempts are made. I have no problem with this behavior, but I would like to understand how can I make my system recover from this type of error?My questions are:
Thank You.
How to reproduce
Fail a Kafka broker while the system is running.
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
debug=..
as necessary) from librdkafka: See attached file rdkafka_log.txt (debug = "broker,topic,queue,msg,protocol")