confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
277 stars 3.15k forks source link

Connection to Broker Timed Out #3703

Closed noamzafrir closed 2 years ago

noamzafrir commented 2 years ago

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Do NOT create issues for questions, use the discussion forum: https://github.com/edenhill/librdkafka/discussions

Description

I am deliberately failing my Kafka brokers in order to test my system. I am running with Kubernetes and I apply 2 types of failures:

  1. Killing the broker's pod - the broker's pod immediately comes back up and re-creates the broker..
  2. Making the broker's pod unavailable for 30 seconds. Regardless of which permutations of failures I am applying, I've noticed that after some time I get a "connection timed out" error: |FAIL|rdkafka#producer-2| [thrd:metro-kafka-bootstrap:9092/bootstrap]: metro-kafka-bootstrap:9092/bootstrap: Connect to ipv4#10.43.107.45:9092 failed: Connection timed out (after 130037ms in state CONNECT) Until this error I see that librdkafka tries to re-connect to the broker. After this error happens no re-connect attempts are made. I have no problem with this behavior, but I would like to understand how can I make my system recover from this type of error?

My questions are:

  1. Which configuration parameter determines the timeout period? I've played around with a few configuration timeout parameters but none seem to make a difference. I also noticed that the timeout in the error message is ~130000ms. I haven't seen any configuration parameter that is set to this value. 2.When the timeout error occurs. Do I get some indication from librdkafka?
  2. How can I recover from this timeout error? Maybe reset the Kafka client?

Thank You.

How to reproduce

Fail a Kafka broker while the system is running.

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

larry-cdn77 commented 2 years ago

I did not gather whether this is an idempotent producer. If yes, are you checking for fatal errors? Not checking was a mistake I had made.

noamzafrir commented 2 years ago

Hi @larry-cdn77, Thanks you for the quick response. I am using a regular producer, not idempotent. Can you elaborate more about the error you experienced. Perhaps it's relevant to me as well.

edenhill commented 2 years ago

The only disconnect I see is the bootstrap connection timing out, which makes sense since it is typically not used after initial cluster connection. The other connection that's used for fetching, broker 1, is indeed still up, so I don't see anything wrong in the logs?

noamzafrir commented 2 years ago

Hi @edenhill, I agree there's nothing wrong in the logs. As I said, I am deliberately failing the broker so this behavior is expected. Question is, how can I identify this "Connection Timed Out" on the client side and do something to recover. Also, which parameter defines this timeout? I see in the log that this timeout was triggered after 130000ms but I haven't seen any parameter that is configured to 130000. Thanks.

ladislavmacoun commented 2 years ago

Connection timed out (after 130037ms in state CONNECT)

means that it has been connected for 130037ms, and then disconnected

edenhill commented 2 years ago

The client will reconnect automatically when necessary. You don't need to do anything on disconnect. also see https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected

noamzafrir commented 2 years ago

Connection timed out (after 130037ms in state CONNECT)

means that it has been connected for 130037ms, and then disconnected

Yes, I understand, but which configuration parameter is responsible to set 130000ms?

noamzafrir commented 2 years ago

The client will reconnect automatically when necessary. You don't need to do anything on disconnect. also see https://github.com/edenhill/librdkafka/wiki/FAQ#why-am-i-seeing-receive-failed-disconnected

Sorry, but that's not what I see in the log. Up until the "Timed Out" message indeed I see re connection attempts, but after the this message I don't see it. From the link you've shared I understand that the "Local: Timed Out" I'm experiencing is actually an indication that the round trip defined for communication between client and broker is exceeded. This is OK given that the broker is down. So question remains, how can I configure my client to keep trying to reconnect to the broker? Is there a specific parameter that limits the number of re connection attempts?

edenhill commented 2 years ago

Yes, I understand, but which configuration parameter is responsible to set 130000ms?

Typically the broker's idle connection reaper, or a load-balancer's.

So question remains, how can I configure my client to keep trying to reconnect to the broker?

It will reconnect automatically as necessary. If it is disconnected from a broker it will only reconnect if it needs to.

Is there a specific parameter that limits the number of re connection attempts?

No.

Are you seeing an actual problem, or just asking about the connection behaviour?

noamzafrir commented 2 years ago

Thank you for the detailed answer @edenhill. As I mentioned at the beginning of this discussion, I want to make sure that my Kafka client can recover after a broker failed and came back up (either immediately back up or after some time). I installed the latest stable version of librdkafka - 1.6.2 (currently using 1.2.1 which is the native version of Ubuntu 20.04). With the new version it looks like the client is recovering after 5-6 minutes, but I want to double check that i's consistent. Is it possible that this problem was not handled well in version 1.2.1? Is 5-6 minutes a reasonable time for recovery?

edenhill commented 2 years ago

1.8.2 is the latest version of librdkafka.

librdkafka will automatically try to recover from all errors.

The recovery time depends on what problem the client is recovering from, so please provide some more details.

noamzafrir commented 2 years ago

I will try version 1.8.2.

With regards to problem - I am emulating a problem of broker failure: I have 2 brokers running on separate Kubernetes (K8S) nodes and I am failing each of them one after the other. The failure I am applying has 2 parts:

  1. Killing the K8S pod. In this case K8S immediately brings the pod (and the broker that runs on it) back up. 2, Failing the K8S pod for 30sec. In this case the broker becomes unavailable for 30sec. Let me know if you need more info. Thanks.
noamzafrir commented 2 years ago

As for version 1.8.2, it is stated here that version 1.6.2 is the latest (not 1.8.2) It is also stated that version 1.8.2 is from Oct 21 while 1.6.2 is from Nov 21. Can you please clarify that?

larry-cdn77 commented 2 years ago

As for version 1.8.2, it is stated here that version 1.6.2 is the latest (not 1.8.2). It is also stated that version 1.8.2 is from Oct 21 while 1.6.2 is from Nov 21. Can you please clarify that?

It can be a little confusing to see the maintenance version at the top of the list but if you scroll down on that releases page, 1.8.2 will also show up. Hope this helps.

noamzafrir commented 2 years ago

Yes, @larry-cdn77, I scrolled down and found version 1.8.2. However, it is still stated that 1.6.2 is the latest. image Hence my confusion.

kirankota608 commented 1 year ago

Hello noamza , i am facing the same issue can you please tell what fixed your error

ING-XIAOJIAN commented 10 months ago

@noamzafrir Hi, I am facing this issue Now !

02915469.895|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1702915469.895|FAIL|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: 1 request(s) timed out: disconnect (after 1563776420ms in state UP)
%4|1702915499.928|FAIL|rdkafka#consumer-1| [thrd:sasl_plaintext://kafka:9092/bootstrap]: sasl_plaintext://kafka:9092/9: Connection setup timed out in state CONNECT (after 30032ms in state CONNECT)

After this error happens no re-connect attempts are made. Librdkafka version: V2.1.0 any idea?

neerajk22 commented 8 months ago

Getting same error and after this Kafka client(Consumer code) not able to connect again, only option is to restart the pods(service). Kindly let me know of there is any way to configure consumer code to re-initiates the connection

%4|1710193417.943|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://brokeraddress.amazonaws.com]: sasl_ssl://b-3.msk-wbrokeraddress.amazonaws.com:9096/3: Connection setup timed out in state APIVERSION_QUERY (after 29924ms in state APIVERSION_QUERY, 1 identical error(s) suppressed) %4|1710193447.946|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://b-2.brokeraddress.amazonaws.com]: sasl_ssl://b-2.msk-wesbroker.amazonaws.com:9096/2: Connection setup timed out in state APIVERSION_QUERY (after 29912ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)

matheusavi commented 6 months ago

I got the same error. After a Kafka server maintenance, the consumer stopped consuming from one partition only. The only solution is to restart the service.