mateuszmrozewski commented 4 years ago

Description

I am observing the following warnings in logs:

08:28:20 [Warn  ] [KAFKA_PRODUCER] [anonymous] [5] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/1]: sasl_plaintext://kafka1:9093/1: Timed out MetadataRequest in flight (after 60659ms, timeout #0)
08:28:20 [Warn  ] [KAFKA_PRODUCER] [anonymous] [4] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/1]: sasl_plaintext://kafka1:9093/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

which effectively results in error callback being called with code -185 and reason sasl_plaintext://kafka1:9093/1: 1 request(s) timed out: disconnect (after 907686ms in state UP).

I see similar errors for ProduceRequest as well.

I have modified the configuration to 'debug=protocol' to get more detailed logs and be able to trace particular requests that time out.

I have a few questions around that:

Is debug=protocol the right setting to get more information about this issue?
Do I understand correctly that these requests are sitting in the queue more than timeout value (60s)?
Are those requests effectively terminated? In particular, if this is a ProduceRequest, will I loose data?
Can it be related to client and/or broker being too busy at that time and effectively the request is sitting in the queue too long?
Is exteding socker.timeout.ms safe to do here as a remediation?

Let me know if providing more details would help to answer the questions. Thank you.

How to reproduce

No way of consistent reproduction found.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

[x] librdkafka version (release number or git tag): v1.0.0
[x] Apache Kafka version: 2.1

[ ] librdkafka client configuration:

<Config name="message.timeout.ms" value="300000" />
<Config name="enable.idempotence" value="true" />
<Config name="linger.ms" value="500" />
<Config name="compression.codec" value="snappy" />
<Config name="batch.num.messages" value="100000" />
<Config name="retry.backoff.ms" value="500" />

[x] Operating system: Centos 7 (x64)>
[x] Provide logs (with debug=.. as necessary) from librdkafka - waiting for reproduction with 'debug=protocol'
[x] Provide broker log excerpts - nothing reasonable on broker side
[x] Critical issue - no

edenhill commented 4 years ago

The first thing you should do is upgrade to v1.4.0, there's been a large number of fixes since v1.0.0 and we only provide free community support for the latest release.

Answers to your questions:

Yes
Yes
No, but your messages may be duplicated unless you are using idempotence. With idempotence enabled there is no risk of duplicates, reorder or lost messages. See https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#message-reliability
Yes, or network congestion.
I don't think it will help much, if a single metadata request takes more than 60s something is wrong.

mateuszmrozewski commented 4 years ago

Thank you for a quick response.

I have enable.idempotence set to true and message.send.max.retries is not changed, so defaults to 2.

Would it be correct to assume that these requests are retried and successful unless a followup errors are reported? Log messages are warning, so that would sound very reasonable.

For the metadata requests taking long, we have a large number of topics (more than 5k) and some of them with 3-6 partitions - I would say ~7k partitions. I suppose that might be potential issue here. I will try to get more detailed logging for this.

And yes, agreed, we should upgrade to the latest version :)

mateuszmrozewski commented 4 years ago

With more detailed logs I managed to find this:

08:22:05.863779912 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 1714 bytes @ 0, CorrId 656)
08:22:05.879245082 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [RECV] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Received ProduceResponse (v7, 57 bytes, CorrId 656, rtt 15.46ms)

...

08:41:26.223728743 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 3145 bytes @ 0, CorrId 657)
08:42:27.170875331 [Warn  ] [KAFKA_PRODUCER] [anonymous] [5] [REQTMOUT] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Timed out ProduceRequest in flight (after 60947ms, timeout #0)
08:42:27.672100720 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 3145 bytes @ 0, CorrId 660)
08:42:27.689210747 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [RECV] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Received ProduceResponse (v7, 57 bytes, CorrId 660, rtt 17.11ms)

It seems that this producer didn't send any messages between 8:22:05 and 08:41:26. This might be ok from the business logic point of view. Between those logs I can see many FetchRequest and matching FetchResponse sent to the same broker. That would suggest the connection to that specific broker is fine from the network point of view. I can also see that the next ProduceRequest to this broker that gets a successful response has CorrId 660.

A few questions on top of that:

I assume I am hitting connections.max.idle.ms. Is that a valid assumption?
How should I interpret the gap in CorrId here? Looks like CorrId 660 is the same size as 657. How can I verify that this is in fact the same request as CorrId 657?

Thank you

edenhill commented 4 years ago

I recommend that you upgrade to v1.4.0, there's been plenty of bug fixes since v1.0.0. Or you wait a couple of days and upgrade to v1.4.2 which we'll hopefully release tomorrow.

edenhill commented 4 years ago

If it is reproducible on 1.4.x we'll troubleshoot this further.

JoakimBlach commented 4 years ago

Hi edenhill

I'm facing a similar issue with a metadata request timeout "Timed out MetadataRequest in flight".

In my setup I'm publishing events to two topics (topic1 and topic_deadletterqueue). On topic_deadletterqueue I rarely publish any events (maybe once a day), and this is where I experience the error. I'm using the confluent-kafka libary in python to interact with Event Hubs Azure, using this recommended config: https://github.com/Azure/azure-event-hubs-for-kafka/blob/master/CONFIGURATION.md#producer-configurations-only.

Do you have any suggestions?

LOGS

%7|1588245108.945|SASLMECHS|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Broker supported SASL mechanisms: PLAIN,OAUTHBEARER

%7|1588245108.945|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Sent SaslAuthenticateRequest (v0, 190 bytes @ 0, CorrId 4)

%7|1588245108.947|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Received SaslAuthenticateResponse (v0, 8 bytes, CorrId 4, rtt 1.70ms)

%7|1588245108.947|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Sent MetadataRequest (v2, 30 bytes @ 0, CorrId 5)

%7|1588245109.091|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Received MetadataResponse (v2, 728 bytes, CorrId 5, rtt 144.39ms)

%7|1588245408.894|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Sent MetadataRequest (v2, 30 bytes @ 0, CorrId 6)

%5|1588245469.421|REQTMOUT|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Timed out MetadataRequest in flight (after 60526ms, timeout #0)

%4|1588245469.421|REQTMOUT|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

%7|1588245469.421|BROKERFAIL|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: failed: err: Local: Timed out: (errno: Connection timed out)

%7|1588245469.421|RETRY|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Retrying MetadataRequest (v2, 30 bytes, retry 1/5, prev CorrId 6) in 100ms

%7|1588245470.490|CONNECTED|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Connected (#2)

%7|1588245470.490|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Sent ApiVersionRequest (v3, 71 bytes @ 0, CorrId 7)

%7|1588245470.492|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Received ApiVersionResponse (v3, 10 bytes, CorrId 7, rtt 1.42ms)

%7|1588245470.492|PROTOERR|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Protocol parse failure for ApiVersion v3 at 3/10 (rd_kafka_handle_ApiVersion:1911) (incorrect broker.version.fallback?)

%7|1588245470.492|PROTOERR|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: ApiArrayCnt -1 out of range

CHECKLIST

librdkafka version (release number or git tag): v1.4.1 Apache Kafka version: None. Using Event Hubs in Microsoft Azure. librdkafka client configuration:

bootstrap.servers:
security.protocol: "SASL_SSL"
ssl.ca.location: "/usr/lib/ssl/certs/ca-certificates.crt"
sasl.mechanism: "PLAIN"
sasl.username: "$ConnectionString"
client.id:
socket.keepalive.enable: "true"
metadata.max.age.ms: 180000
enable.idempotence: "false"
compression.codec: "none"
linger.ms: 10
request.timeout.ms: 50000
delivery.timeout.ms: 250050
retries: 5

Operating system: ubuntu:19.10 Provide logs (with debug=.. as necessary) from librdkafka - Check ! Provide broker log excerpts - nothing reasonable on broker side Critical issue - yes

Best, Joakim

edenhill commented 4 years ago

Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure. I recommend opening an issue with the Azure folks.

JoakimBlach commented 4 years ago

Thanks for a swift response. I'm talking to Azure already. So far they have not been able to identify the problem. But I will let you know what they come up with.

mateuszmrozewski commented 4 years ago

Thank you for the comments @edenhill. I will be updating to v1.4.2 when it's available and I will try to debug the issue again. I will raise a new issue if needed.

JoakimBlach commented 4 years ago

@edenhill I got an answer from Azure:

The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly(Kafka).

So they asked me to reach out to understand how librdkafka is retrying to keep the connection alive if no events are sent to the event hub for 10 min or more?

Thanks

edenhill commented 4 years ago

Apart from periodic metadata refresh (topic.metadata.refresh.interval.ms) that will use a random connection, there is no kafka protocol-level "keepalive".

When sending the periodic metadata refreshes (which is what is timing out here) we could try to weigh the available connections by idle time and use the connection which was most recently used. This will hopefully avoid using a connection that has timed out on the load balancer, and also let truly idle connections time out as they wont get used for the metadata refreshes.

As for the idle connection closing, I'm suspecting that their load balancers are silently dropping the connection without sending an RST or FIN to the client, so there is no way for the client TCP stack to know the connection was closed until a request (or TCP keepalive) is sent. If the connection was properly TCP-closed the client would get notified instantly of the close.

cculver commented 4 years ago

We have been following this thread as well. We are operating Kafka on Azure and we are having trouble producing to confluent cloud occasionally. I wanted to confirm that we have upgraded our client to v1.4.2 and the issue still exists. We are seeing in our logs that the client is timing out. I have logs if you're interested.

edenhill commented 4 years ago

Is there anything in your logs that contradicts the idle connection reaper hypothesis?

cculver commented 4 years ago

No, in fact I believe that makes a lot of sense here. How far off are we from a solution?

edenhill commented 4 years ago

The fix will go into v1.5.0 which is scheduled for early June.

These errors should not have much effect though, it is just a periodic metadata refresh that gets dropped.

cculver commented 4 years ago

Part of our problem is that it freezes the client for 60 seconds, limiting our ability to respond to the original request in a timely manner. The only way around this would be to produce inside another thread. Is this the suggested route?

edenhill commented 4 years ago

What exactly is freezing? A missed metadata refresh and a closed idle connection shouldnt cause any freezes

cculver commented 4 years ago

We're using the .net client and the call to Produce() does not return for 60 seconds.

edenhill commented 4 years ago

Can you provide your producer config and reproduce this with Debug: "topic,msg,metadata,broker" and provide the logs?

cculver commented 4 years ago

We were able to successfully cause a producer to freeze. Please note that this is an intermittent issue and took about 10 produce calls to make it happen. This froze the call for about 40 seconds. I included all debugging information from the start until a bit after the problem occurred. You'll find the relevant portion from line 4964 to line 5099 of the attached file. Our producer config has all defaults except for the connection settings and the debug setting you instructed us to insert above.

"Debug": "topic,msg,metadata,broker", "BootstrapServers": "....eastus.azure.confluent.cloud:9092", "SaslMechanism": "Plain", "SecurityProtocol": "SaslSsl", "SaslUsername": "...", "SaslPassword": "..." logs.txt

Thanks for your help.

edenhill commented 4 years ago

Thank you!

So I need to revise my previous statement.. The client is mostly idle and the Azure load balancer is silently (without RST/FIN) killing off the client's idle connections, but the client won't know about until it attempts to use the connection at which point the requests will time out due to not getting any TCP ACKs (or RSTs for that matter) back. While it is a MetadataRequest timing out , which in itself doesn't have much adverse effects, it is head of line blocking the sub-sequent produce request on the same connection, and this is why your synchronous produce is blocking/freezing.

There isn't that much the client can do about it short of lowering the socket.timeout.ms/request.timeout.ms. For these things to function properly it would be useful if the Azure load balancer either actively closed idle connections through FIN/RST or at least actively refused packets to unknown TCP sessions so the client could identify the socket as closed much quicker than the Kafka protocol timeout.

cculver commented 4 years ago

Thanks for the assistance. Unbeknownst to me, we were using an Azure load balancer that was not configured to send TCP reset on idle. We're working on this change and we expect that this will remedy the issue.

edenhill commented 4 years ago

Let us know how it goes :+1:

algattik commented 4 years ago

Having the same issue with librdkafka 1.4.2 connecting to Azure Event Hubs for Kafka.

I have an application creating a Producer and a Consumer. The Producer is not producing any messages. The Consumer is consuming a message with C# Confluent.Kafka IConsumer.Consume(cancellationToken), and no messages are arriving in the topic.

The producer goes down after 5 minutes: https://gist.github.com/algattik/f9926b1d58a2c9b12a7a99a717bcf8b0 (with Debug: "topic,msg,metadata,broker")

When setting topic.metadata.refresh.interval.ms to 60000 (1 minute instead of the default 10 minutes) that error disappears.

However the consumer goes down after 30 minutes: https://gist.github.com/algattik/c16375baeeb824e8b5ed948cdafbe22c

cculver commented 4 years ago

I'm writing back with an intermediate result. We didn't have much luck configuring the azure load balancer. Even with the configurable version of their load balancer (the standard version, vs the basic one), the sned TCP reset setting on and the idle timeout set to 30 minutes, we were still seeing metadata timeouts after 4-6 minutes after starting the application. This indicates that the load balancer settings were not working for us. After finding recommended configuration settings for Azure Event Hubs (Microsoft's own Kafka cloud implementation) here, we added the following settings to consumers and producers and are seeing success.

Producers: socket.keepalive.enable: true metadata.max.age.ms: 180000 connections.max.idle.ms: 180000 max.request.size: 1000000 metadata.max.idle.ms: 180000

Consumers: socket.keepalive.enable: true metadata.max.age.ms: 180000 connections.max.idle.ms: 180000 heartbeat.interval.ms: 3000 session.timeout.ms: 30000

This is less than ideal, but really isn't the end of the world. I'm just glad it's working.

cculver commented 4 years ago

Unfortunately, that attempted fix did not end up working for producers. The referenced document did not end up helping much because the settings we were hoping would help were actually targeted to the Java client and are not available in librdkafka, nor in the C# library we are using. I'm still trying to solve this issue by going through Azure support to see if we configured our load balancer incorrectly, but I'm not sure if fixing the load balancer on our side will even correct the issue because the other size (Confluent Cloud) is on Azure as well (and they may not have addressed this timeout problem on their load balancers). What would a recommended producer configuration look like that would ensure the connection transmits data before Azure drops the connection after its default 4 minute idle timeout expires?

edenhill commented 4 years ago

The fix in https://github.com/edenhill/librdkafka/commit/052745727ce86c2c36c7cc3c4aefc181c5da88d8 will prefer to use the least idle connection when selecting "any" broker connection for metadata refresh, etc. This in combination with a topic.metadata.refresh.interval.ms < the LB/Kafka idle connection timeout should hopefully minimize the number of these timeout-due-to-killed-connection errors.

mateuszmrozewski commented 4 years ago

@edenhill, I have upgraded to 1.4.2. After the upgrade I have received similar error once:

05:56:11.042931042 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Sent MetadataRequest (v2, 516 bytes @ 0, CorrId 178040)
...
05:57:11.072364910 [Warn  ] [KAFKA_PRODUCER] [anonymous] [5] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Timed out MetadataRequest in flight (after 60029ms, timeout #0)
05:57:11.072370765 [Warn  ] [KAFKA_PRODUCER] [anonymous] [4] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
05:57:11.072649158 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [BROKERFAIL] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: failed: err: Local: Timed out: (errno: Connection timed out)
05:57:11.073120106 [Warn  ] [KAFKA_PRODUCER] [anonymous] [7] [RETRY] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Retrying MetadataRequest (v2, 516 bytes, retry 1/2147483647, prev CorrId 178040) in 500ms

Between the time this metadata request was sent and the retry I haven't notice any other errors. In fact I can see plenty of successful FetchRequests to the same broker. I have checked the server.log for this broker and haven't noticed any warnings or errors around that timeframe.

Now, that leads me to thinking it was some intermittent network issue. The request seems to be retried and that is fine. However, I am not sure what is the best way to handle such situation. I am registering an error_cb and my alerting is based on that. Can I easily distinguish intermittent errors that will be retried by librdkafka?

In general, I would like to be able to react on errors that librdkafka was not able to handle internally.

edenhill commented 4 years ago

Since the client will automatically retry and recover there is nothing you need to do from the application's perspective.

note: this issue will be fixed in v1.5.0, not v1.4.2.

mateuszmrozewski commented 4 years ago

Thank you @edenhill. Once v1.5.0 is available I will retry.

Regarding the error handling, I understand and can confirm that client automatically retries. My concern is about how to distinguish errors that client retries from those that can't be retried. It seems that the error callback is called in both cases.

edenhill commented 4 years ago

We're looking to improve the error propagation by indicating severity, retriability, etc. As of now you should consider all error_cb errors as informational and not terminal, with the exception of ERR__AUTH which indicates an authentication error that should probably be handled as fatal by the application.

mateuszmrozewski commented 4 years ago

Thank you, looking forward to the improvements :)

trungngonptit commented 1 year ago

Does anyone still face this problem with later releases? For me: 1.9.2 "1 request(s) timed out: disconnect..." also "failed to acquire medata: local: timed out" still occur when the producer is at high load (~3k msgs/s), each msg is about 20kB, netword throughput is about 10Gbps

confluentinc / librdkafka

[Question] Understanding "1 request(s) timed out: disconnect" #2845

Description

How to reproduce

Checklist