Closed shweta-fourkites closed 2 years ago
@dnwe @bai @slaunay
@shweta-fourkites if you are using Samara 1.31.1
, this is likely a duplicate of #2129 (producer deadlock) that was fixed in #2133.
The fix is available in 1.32.0
released recently but not yet listed in the CHANGELOG.md
.
There is also another similar issue #2150 that has been difficult to reproduce where the producer would hang and a record with a nil
key and a nil
value is produced to the target topic.
@slaunay Thank you for checking this. I will update the version to 1.32.0 and see if that helps..
@slaunay , with version v1.32.0
2022/03/14 15:54:42.498472 async_producer.go:611: producer/leader/testtopic/4 state change to [retrying-1] 2022/03/14 15:54:42.498480 async_producer.go:621: producer/leader/testtopic/4 abandoning broker 47 2022/03/14 15:54:42.498651 async_producer.go:809: producer/broker/47 input chan closed 2022/03/14 15:54:42.498691 async_producer.go:899: producer/broker/47 shut down 2022/03/14 15:54:42.599000 sarama.go:122: client/metadata fetching metadata for [testtopic] from broker testendpoint 2022/03/14 15:57:39.017057 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 15:57:39.315335 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1 2022/03/14 15:57:39.623175 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 16:07:39.017423 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 16:07:39.315637 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1 2022/03/14 16:07:39.623326 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint
The connection is not re-established unless i restart my application
@slaunay @dnwe is this expected even after the fix in V1.32.0? After upgrading to 1.32.0 I see the frequency of this has reduced but still it is happening for few consumers in our cluster. If this is expected, is there a plan to fix this in the near future?
@shweta-fourkites I was able to reproduce what looks like #2150 yesterday.
There seem to be a race condition during retries where:
brokerProducer
is shutting down after getting an I/O error and the retry logic is triggeredpartitionProducer
a message with the fin
flag is sent to the "unheathly" brokerProducer
and that brokerProducer
is abandoned (closed)brokerProducer
is then createdbrokerProducer
(being closed) fails again then the retry logic is triggered againpartitionProducer
a message with the fin
flag is sent again but this time to the "heathly" brokerProducer
brokerProducer
treat the fin
flag as a regular message and use it in the next ProduceRequest
fin
message is not bubbling up to the partitionerProducer
that will buffer new message till it sees that fin
messagefin
message escaped to the target Kafka topic the partitionProducer
is not producing anymore (just buffering)It's not clear to me if this was also possible before honouring produce request pipelining.
Can you confirm that when the producer stops producing to that particular topic partition, a Kafka record with a null
key and value was produced in the target Kafka topic?
I was going to submit a simple test case to reproduce it with a fix but there seem to be a fix already in #2182.
@dnwe Should we add another retract
directive for v1.32.0
referencing #2150 (and marking that issue as fixed) when v1.32.1
is released and if #2182 gets merged?
Hi @slaunay @niamster @dnwe , I patched the changes in https://github.com/Shopify/sarama/pull/2182 and I am no longer facing the issue! Are we planning to get this merged and release a new version with the fix anytime soon?
Thanks a lot for testing the fix @shweta-fourkites and confirming that the issue is fixed.
I'm not a maintainer for Sarama but I believe the goal is indeed to have the fix merged, add a retract for v1.32.0
and release a new version.
For now we have reverted back to 1.30.1 until this has been released
Decided to downgrade to 1.30.1 too as I have stumbled upon same bug.
We are also still seeing the issue happening in 1.32.0, it's happening less often than 1.31.0 but still happens from time to time. Is there any plans for releasing the fix?
@eafzali we are basically ready to cut a release, but we were waiting for more confirmation that github.com/Shopify/Sarama@main fixed the issue for people before cutting the release. Have you managed to test with main?
Not yet, we have switched back to 1.30.1
and so far we didn't have a problem. We can try out main
but it's really hard for us to "confirm" anything as it randomly happens for us once or twice a week. I guess related to the stability of our Kafka Cluster :)
@dnwe any idea when the changes from https://github.com/Shopify/sarama/pull/2182 can be released to a stable branch for use in production?
experiencing the same issue with sarama v1.37.2 connecting to Confluent cloud kafka instances. write: broken pipe
same error in v1.37.2 producer on error callback : kafka: Failed to produce message to topic apm-aidc-aggs-sync: write tcp ip:port->kafka-ip:9092: write: broken pipe
issue continues with sarama v1.38.1... write tcp ... write: broken pipe
perhaps with the backing of IBM this issue can finally be addressed ? using v1.38.2-0.20230327141928-9127f1c432c0
Versions
Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.
Sarama | Go | Kafka v1.31.1 | 1.16.6 | 2.2.1
Configuration
What configuration values are you using for Sarama and Kafka?
config := sarama.NewConfig() config.Producer.Return.Successes = true config.Version = sarama.V2_2_1_0
Logs
When filing an issue please provide logs from Sarama and Kafka if at all possible. You can set
sarama.Logger
to alog.Logger
to capture Sarama debug output.logs: CLICK ME
2022/03/09 21:04:40.888730 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint1:9092 2022/03/09 21:04:41.189656 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:04:41.295781 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:04:41.295838 client.go:929: client/metadata got error from broker 77 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:04:41.295914 sarama.go:122: Closed connection to broker test-broker-endpoint:9092 2022/03/09 21:04:41.295927 sarama.go:122: client/brokers deregistered broker #77 at test-broker-endpoint:9092 2022/03/09 21:04:41.295944 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092 2022/03/09 21:04:41.489721 sarama.go:122: client/brokers registered new broker #77 at test-broker-endpoint:9092 2022/03/09 21:14:40.889372 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:14:41.189707 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:14:41.189739 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:14:41.192789 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77) 2022/03/09 21:14:41.295503 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:14:41.295542 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:14:41.300616 sarama.go:122: Connected to broker at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 (registered as #74) 2022/03/09 21:24:40.888707 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:24:41.189268 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189316 client.go:929: client/metadata got error from broker 74 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:24:41.189360 sarama.go:122: Closed connection to broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189370 sarama.go:122: client/brokers deregistered broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189376 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092 2022/03/09 21:24:41.189396 client.go:929: client/metadata got error from broker 47 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:24:41.189427 sarama.go:122: Closed connection to broker test-broker-endpoint2:9092 2022/03/09 21:24:41.189436 sarama.go:122: client/brokers deregistered broker #47 at test-broker-endpoint2:9092 2022/03/09 21:24:41.189441 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:24:41.232417 sarama.go:122: client/brokers registered new broker #47 at test-broker-endpoint2:9092 2022/03/09 21:24:41.232435 sarama.go:122: client/brokers registered new broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.295667 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:34:40.888928 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:34:41.189283 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189340 client.go:929: client/metadata got error from broker 48 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:34:41.189405 sarama.go:122: Closed connection to broker test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189414 sarama.go:122: client/brokers deregistered broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189420 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:34:41.235015 sarama.go:122: client/brokers registered new broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.295952 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296034 client.go:929: client/metadata got error from broker 75 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:34:41.296111 sarama.go:122: Closed connection to broker test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296129 sarama.go:122: client/brokers deregistered broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296143 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:34:41.296194 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:34:41.299010 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77) 2022/03/09 21:34:41.346014 sarama.go:122: client/brokers registered new broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:44:40.889425 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092
``` ```
Problem Description
We are using SyncProducer to push messages to a destination kafka topic. Sometimes, we see the write tcp host:47562->host:9092: write: broken pipe error in logs which is printed after the timeout. Once this happens, the metadata fetch happens every 10 mins (default), nothing else is happening between the fetches. The producer is never able to send messages after this. The producer is just struck.
On restarting the application, the producer is able to obtain connection again and start sending the messages (Until this happens again!)
Has anybody faced this? Does it have to do with any producer configuration that we missed overriding? The expectation is for the producer to connect back to broker and start sending messages automatically without requiring an app restart.