Closed mateuszmrozewski closed 4 years ago
The first thing you should do is upgrade to v1.4.0, there's been a large number of fixes since v1.0.0 and we only provide free community support for the latest release.
Answers to your questions:
Thank you for a quick response.
I have enable.idempotence set to true and message.send.max.retries is not changed, so defaults to 2.
Would it be correct to assume that these requests are retried and successful unless a followup errors are reported? Log messages are warning, so that would sound very reasonable.
For the metadata requests taking long, we have a large number of topics (more than 5k) and some of them with 3-6 partitions - I would say ~7k partitions. I suppose that might be potential issue here. I will try to get more detailed logging for this.
And yes, agreed, we should upgrade to the latest version :)
With more detailed logs I managed to find this:
08:22:05.863779912 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 1714 bytes @ 0, CorrId 656)
08:22:05.879245082 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [RECV] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Received ProduceResponse (v7, 57 bytes, CorrId 656, rtt 15.46ms)
...
08:41:26.223728743 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 3145 bytes @ 0, CorrId 657)
08:42:27.170875331 [Warn ] [KAFKA_PRODUCER] [anonymous] [5] [REQTMOUT] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Timed out ProduceRequest in flight (after 60947ms, timeout #0)
08:42:27.672100720 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Sent ProduceRequest (v7, 3145 bytes @ 0, CorrId 660)
08:42:27.689210747 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [RECV] [thrd:sasl_plaintext://kafka3:9093/3]: sasl_plaintext://kafka3:9093/3: Received ProduceResponse (v7, 57 bytes, CorrId 660, rtt 17.11ms)
It seems that this producer didn't send any messages between 8:22:05 and 08:41:26. This might be ok from the business logic point of view. Between those logs I can see many FetchRequest and matching FetchResponse sent to the same broker. That would suggest the connection to that specific broker is fine from the network point of view. I can also see that the next ProduceRequest to this broker that gets a successful response has CorrId 660.
A few questions on top of that:
connections.max.idle.ms
. Is that a valid assumption?Thank you
I recommend that you upgrade to v1.4.0, there's been plenty of bug fixes since v1.0.0. Or you wait a couple of days and upgrade to v1.4.2 which we'll hopefully release tomorrow.
If it is reproducible on 1.4.x we'll troubleshoot this further.
Hi edenhill
I'm facing a similar issue with a metadata request timeout "Timed out MetadataRequest in flight".
In my setup I'm publishing events to two topics (topic1 and topic_deadletterqueue). On topic_deadletterqueue I rarely publish any events (maybe once a day), and this is where I experience the error. I'm using the confluent-kafka libary in python to interact with Event Hubs Azure, using this recommended config: https://github.com/Azure/azure-event-hubs-for-kafka/blob/master/CONFIGURATION.md#producer-configurations-only.
Do you have any suggestions?
LOGS
%7|1588245108.945|SASLMECHS|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Broker supported SASL mechanisms: PLAIN,OAUTHBEARER
%7|1588245108.945|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Sent SaslAuthenticateRequest (v0, 190 bytes @ 0, CorrId 4)
%7|1588245108.947|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Received SaslAuthenticateResponse (v0, 8 bytes, CorrId 4, rtt 1.70ms)
%7|1588245108.947|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Sent MetadataRequest (v2, 30 bytes @ 0, CorrId 5)
%7|1588245109.091|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/bootstrap: Received MetadataResponse (v2, 728 bytes, CorrId 5, rtt 144.39ms)
%7|1588245408.894|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Sent MetadataRequest (v2, 30 bytes @ 0, CorrId 6)
%5|1588245469.421|REQTMOUT|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Timed out MetadataRequest in flight (after 60526ms, timeout #0)
%4|1588245469.421|REQTMOUT|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%7|1588245469.421|BROKERFAIL|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: failed: err: Local: Timed out: (errno: Connection timed out)
%7|1588245469.421|RETRY|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Retrying MetadataRequest (v2, 30 bytes, retry 1/5, prev CorrId 6) in 100ms
%7|1588245470.490|CONNECTED|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Connected (#2)
%7|1588245470.490|SEND|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Sent ApiVersionRequest (v3, 71 bytes @ 0, CorrId 7)
%7|1588245470.492|RECV|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Received ApiVersionResponse (v3, 10 bytes, CorrId 7, rtt 1.42ms)
%7|1588245470.492|PROTOERR|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: Protocol parse failure for ApiVersion v3 at 3/10 (rd_kafka_handle_ApiVersion:1911) (incorrect broker.version.fallback?)
%7|1588245470.492|PROTOERR|client.id#producer-1| [thrd:sasl_ssl://bootstrap.servers/bootstrap]: sasl_ssl://bootstrap.servers/0: ApiArrayCnt -1 out of range
CHECKLIST
librdkafka version (release number or git tag): v1.4.1 Apache Kafka version: None. Using Event Hubs in Microsoft Azure. librdkafka client configuration:
Operating system: ubuntu:19.10 Provide logs (with debug=.. as necessary) from librdkafka - Check ! Provide broker log excerpts - nothing reasonable on broker side Critical issue - yes
Best, Joakim
Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure. I recommend opening an issue with the Azure folks.
Thanks for a swift response. I'm talking to Azure already. So far they have not been able to identify the problem. But I will let you know what they come up with.
Thank you for the comments @edenhill. I will be updating to v1.4.2 when it's available and I will try to debug the issue again. I will raise a new issue if needed.
@edenhill I got an answer from Azure:
The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly(Kafka).
So they asked me to reach out to understand how librdkafka is retrying to keep the connection alive if no events are sent to the event hub for 10 min or more?
Thanks
Apart from periodic metadata refresh (topic.metadata.refresh.interval.ms) that will use a random connection, there is no kafka protocol-level "keepalive".
When sending the periodic metadata refreshes (which is what is timing out here) we could try to weigh the available connections by idle time and use the connection which was most recently used. This will hopefully avoid using a connection that has timed out on the load balancer, and also let truly idle connections time out as they wont get used for the metadata refreshes.
As for the idle connection closing, I'm suspecting that their load balancers are silently dropping the connection without sending an RST or FIN to the client, so there is no way for the client TCP stack to know the connection was closed until a request (or TCP keepalive) is sent. If the connection was properly TCP-closed the client would get notified instantly of the close.
We have been following this thread as well. We are operating Kafka on Azure and we are having trouble producing to confluent cloud occasionally. I wanted to confirm that we have upgraded our client to v1.4.2 and the issue still exists. We are seeing in our logs that the client is timing out. I have logs if you're interested.
Is there anything in your logs that contradicts the idle connection reaper hypothesis?
No, in fact I believe that makes a lot of sense here. How far off are we from a solution?
The fix will go into v1.5.0 which is scheduled for early June.
These errors should not have much effect though, it is just a periodic metadata refresh that gets dropped.
Part of our problem is that it freezes the client for 60 seconds, limiting our ability to respond to the original request in a timely manner. The only way around this would be to produce inside another thread. Is this the suggested route?
What exactly is freezing? A missed metadata refresh and a closed idle connection shouldnt cause any freezes
We're using the .net client and the call to Produce() does not return for 60 seconds.
Can you provide your producer config and reproduce this with Debug: "topic,msg,metadata,broker"
and provide the logs?
We were able to successfully cause a producer to freeze. Please note that this is an intermittent issue and took about 10 produce calls to make it happen. This froze the call for about 40 seconds. I included all debugging information from the start until a bit after the problem occurred. You'll find the relevant portion from line 4964 to line 5099 of the attached file. Our producer config has all defaults except for the connection settings and the debug setting you instructed us to insert above.
"Debug": "topic,msg,metadata,broker", "BootstrapServers": "....eastus.azure.confluent.cloud:9092", "SaslMechanism": "Plain", "SecurityProtocol": "SaslSsl", "SaslUsername": "...", "SaslPassword": "..." logs.txt
Thanks for your help.
Thank you!
So I need to revise my previous statement.. The client is mostly idle and the Azure load balancer is silently (without RST/FIN) killing off the client's idle connections, but the client won't know about until it attempts to use the connection at which point the requests will time out due to not getting any TCP ACKs (or RSTs for that matter) back. While it is a MetadataRequest timing out , which in itself doesn't have much adverse effects, it is head of line blocking the sub-sequent produce request on the same connection, and this is why your synchronous produce is blocking/freezing.
There isn't that much the client can do about it short of lowering the socket.timeout.ms/request.timeout.ms. For these things to function properly it would be useful if the Azure load balancer either actively closed idle connections through FIN/RST or at least actively refused packets to unknown TCP sessions so the client could identify the socket as closed much quicker than the Kafka protocol timeout.
Thanks for the assistance. Unbeknownst to me, we were using an Azure load balancer that was not configured to send TCP reset on idle. We're working on this change and we expect that this will remedy the issue.
Let us know how it goes :+1:
Having the same issue with librdkafka 1.4.2 connecting to Azure Event Hubs for Kafka.
I have an application creating a Producer and a Consumer. The Producer is not producing any messages. The Consumer is consuming a message with C# Confluent.Kafka IConsumer.Consume(cancellationToken), and no messages are arriving in the topic.
The producer goes down after 5 minutes: https://gist.github.com/algattik/f9926b1d58a2c9b12a7a99a717bcf8b0 (with Debug: "topic,msg,metadata,broker"
)
When setting topic.metadata.refresh.interval.ms to 60000 (1 minute instead of the default 10 minutes) that error disappears.
However the consumer goes down after 30 minutes: https://gist.github.com/algattik/c16375baeeb824e8b5ed948cdafbe22c
I'm writing back with an intermediate result. We didn't have much luck configuring the azure load balancer. Even with the configurable version of their load balancer (the standard version, vs the basic one), the sned TCP reset setting on and the idle timeout set to 30 minutes, we were still seeing metadata timeouts after 4-6 minutes after starting the application. This indicates that the load balancer settings were not working for us. After finding recommended configuration settings for Azure Event Hubs (Microsoft's own Kafka cloud implementation) here, we added the following settings to consumers and producers and are seeing success.
Producers: socket.keepalive.enable: true metadata.max.age.ms: 180000 connections.max.idle.ms: 180000 max.request.size: 1000000 metadata.max.idle.ms: 180000
Consumers: socket.keepalive.enable: true metadata.max.age.ms: 180000 connections.max.idle.ms: 180000 heartbeat.interval.ms: 3000 session.timeout.ms: 30000
This is less than ideal, but really isn't the end of the world. I'm just glad it's working.
Unfortunately, that attempted fix did not end up working for producers. The referenced document did not end up helping much because the settings we were hoping would help were actually targeted to the Java client and are not available in librdkafka, nor in the C# library we are using. I'm still trying to solve this issue by going through Azure support to see if we configured our load balancer incorrectly, but I'm not sure if fixing the load balancer on our side will even correct the issue because the other size (Confluent Cloud) is on Azure as well (and they may not have addressed this timeout problem on their load balancers). What would a recommended producer configuration look like that would ensure the connection transmits data before Azure drops the connection after its default 4 minute idle timeout expires?
The fix in https://github.com/edenhill/librdkafka/commit/052745727ce86c2c36c7cc3c4aefc181c5da88d8 will prefer to use the least idle connection when selecting "any" broker connection for metadata refresh, etc. This in combination with a topic.metadata.refresh.interval.ms < the LB/Kafka idle connection timeout should hopefully minimize the number of these timeout-due-to-killed-connection errors.
@edenhill, I have upgraded to 1.4.2. After the upgrade I have received similar error once:
05:56:11.042931042 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [SEND] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Sent MetadataRequest (v2, 516 bytes @ 0, CorrId 178040)
...
05:57:11.072364910 [Warn ] [KAFKA_PRODUCER] [anonymous] [5] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Timed out MetadataRequest in flight (after 60029ms, timeout #0)
05:57:11.072370765 [Warn ] [KAFKA_PRODUCER] [anonymous] [4] [REQTMOUT] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
05:57:11.072649158 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [BROKERFAIL] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: failed: err: Local: Timed out: (errno: Connection timed out)
05:57:11.073120106 [Warn ] [KAFKA_PRODUCER] [anonymous] [7] [RETRY] [thrd:sasl_plaintext://kafka1:9093/2]: sasl_plaintext://kafka1:9093/2: Retrying MetadataRequest (v2, 516 bytes, retry 1/2147483647, prev CorrId 178040) in 500ms
Between the time this metadata request was sent and the retry I haven't notice any other errors. In fact I can see plenty of successful FetchRequests to the same broker. I have checked the server.log for this broker and haven't noticed any warnings or errors around that timeframe.
Now, that leads me to thinking it was some intermittent network issue. The request seems to be retried and that is fine. However, I am not sure what is the best way to handle such situation. I am registering an error_cb and my alerting is based on that. Can I easily distinguish intermittent errors that will be retried by librdkafka?
In general, I would like to be able to react on errors that librdkafka was not able to handle internally.
Since the client will automatically retry and recover there is nothing you need to do from the application's perspective.
note: this issue will be fixed in v1.5.0, not v1.4.2.
Thank you @edenhill. Once v1.5.0 is available I will retry.
Regarding the error handling, I understand and can confirm that client automatically retries. My concern is about how to distinguish errors that client retries from those that can't be retried. It seems that the error callback is called in both cases.
We're looking to improve the error propagation by indicating severity, retriability, etc. As of now you should consider all error_cb errors as informational and not terminal, with the exception of ERR__AUTH which indicates an authentication error that should probably be handled as fatal by the application.
Thank you, looking forward to the improvements :)
Does anyone still face this problem with later releases? For me: 1.9.2 "1 request(s) timed out: disconnect..." also "failed to acquire medata: local: timed out" still occur when the producer is at high load (~3k msgs/s), each msg is about 20kB, netword throughput is about 10Gbps
Description
I am observing the following warnings in logs:
which effectively results in error callback being called with code -185 and reason
sasl_plaintext://kafka1:9093/1: 1 request(s) timed out: disconnect (after 907686ms in state UP)
.I see similar errors for ProduceRequest as well.
I have modified the configuration to 'debug=protocol' to get more detailed logs and be able to trace particular requests that time out.
I have a few questions around that:
debug=protocol
the right setting to get more information about this issue?ProduceRequest
, will I loose data?socker.timeout.ms
safe to do here as a remediation?Let me know if providing more details would help to answer the questions. Thank you.
How to reproduce
No way of consistent reproduction found.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
v1.0.0
2.1
Centos 7 (x64)>
debug=..
as necessary) from librdkafka - waiting for reproduction with 'debug=protocol'