IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.57k stars 1.76k forks source link

Sync Producer stalled - Unable to send messages to Kafka after a write tcp: broken pipe error #2173

Closed shweta-fourkites closed 2 years ago

shweta-fourkites commented 2 years ago
Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama | Go | Kafka v1.31.1 | 1.16.6 | 2.2.1

Configuration

What configuration values are you using for Sarama and Kafka?

config := sarama.NewConfig() config.Producer.Return.Successes = true config.Version = sarama.V2_2_1_0

Logs

When filing an issue please provide logs from Sarama and Kafka if at all possible. You can set sarama.Logger to a log.Logger to capture Sarama debug output.

logs: CLICK ME

2022/03/09 21:04:40.888730 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint1:9092 2022/03/09 21:04:41.189656 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:04:41.295781 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:04:41.295838 client.go:929: client/metadata got error from broker 77 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:04:41.295914 sarama.go:122: Closed connection to broker test-broker-endpoint:9092 2022/03/09 21:04:41.295927 sarama.go:122: client/brokers deregistered broker #77 at test-broker-endpoint:9092 2022/03/09 21:04:41.295944 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092 2022/03/09 21:04:41.489721 sarama.go:122: client/brokers registered new broker #77 at test-broker-endpoint:9092 2022/03/09 21:14:40.889372 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:14:41.189707 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:14:41.189739 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:14:41.192789 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77) 2022/03/09 21:14:41.295503 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:14:41.295542 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:14:41.300616 sarama.go:122: Connected to broker at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 (registered as #74) 2022/03/09 21:24:40.888707 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:24:41.189268 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189316 client.go:929: client/metadata got error from broker 74 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:24:41.189360 sarama.go:122: Closed connection to broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189370 sarama.go:122: client/brokers deregistered broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.189376 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092 2022/03/09 21:24:41.189396 client.go:929: client/metadata got error from broker 47 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:24:41.189427 sarama.go:122: Closed connection to broker test-broker-endpoint2:9092 2022/03/09 21:24:41.189436 sarama.go:122: client/brokers deregistered broker #47 at test-broker-endpoint2:9092 2022/03/09 21:24:41.189441 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:24:41.232417 sarama.go:122: client/brokers registered new broker #47 at test-broker-endpoint2:9092 2022/03/09 21:24:41.232435 sarama.go:122: client/brokers registered new broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:24:41.295667 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 2022/03/09 21:34:40.888928 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092 2022/03/09 21:34:41.189283 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189340 client.go:929: client/metadata got error from broker 48 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:34:41.189405 sarama.go:122: Closed connection to broker test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189414 sarama.go:122: client/brokers deregistered broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.189420 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:34:41.235015 sarama.go:122: client/brokers registered new broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.295952 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296034 client.go:929: client/metadata got error from broker 75 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe 2022/03/09 21:34:41.296111 sarama.go:122: Closed connection to broker test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296129 sarama.go:122: client/brokers deregistered broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:34:41.296143 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific. 2022/03/09 21:34:41.296194 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092 2022/03/09 21:34:41.299010 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77) 2022/03/09 21:34:41.346014 sarama.go:122: client/brokers registered new broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092 2022/03/09 21:44:40.889425 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092

``` ```

Problem Description

We are using SyncProducer to push messages to a destination kafka topic. Sometimes, we see the write tcp host:47562->host:9092: write: broken pipe error in logs which is printed after the timeout. Once this happens, the metadata fetch happens every 10 mins (default), nothing else is happening between the fetches. The producer is never able to send messages after this. The producer is just struck.

On restarting the application, the producer is able to obtain connection again and start sending the messages (Until this happens again!)

Has anybody faced this? Does it have to do with any producer configuration that we missed overriding? The expectation is for the producer to connect back to broker and start sending messages automatically without requiring an app restart.

shweta-fourkites commented 2 years ago

@dnwe @bai @slaunay

slaunay commented 2 years ago

@shweta-fourkites if you are using Samara 1.31.1, this is likely a duplicate of #2129 (producer deadlock) that was fixed in #2133. The fix is available in 1.32.0 released recently but not yet listed in the CHANGELOG.md.

There is also another similar issue #2150 that has been difficult to reproduce where the producer would hang and a record with a nil key and a nil value is produced to the target topic.

shweta-fourkites commented 2 years ago

@slaunay Thank you for checking this. I will update the version to 1.32.0 and see if that helps..

shweta-fourkites commented 2 years ago

@slaunay , with version v1.32.0

2022/03/14 15:54:42.498472 async_producer.go:611: producer/leader/testtopic/4 state change to [retrying-1] 2022/03/14 15:54:42.498480 async_producer.go:621: producer/leader/testtopic/4 abandoning broker 47 2022/03/14 15:54:42.498651 async_producer.go:809: producer/broker/47 input chan closed 2022/03/14 15:54:42.498691 async_producer.go:899: producer/broker/47 shut down 2022/03/14 15:54:42.599000 sarama.go:122: client/metadata fetching metadata for [testtopic] from broker testendpoint 2022/03/14 15:57:39.017057 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 15:57:39.315335 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1 2022/03/14 15:57:39.623175 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 16:07:39.017423 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint 2022/03/14 16:07:39.315637 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1 2022/03/14 16:07:39.623326 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint

The connection is not re-established unless i restart my application

shweta-fourkites commented 2 years ago

@slaunay @dnwe is this expected even after the fix in V1.32.0? After upgrading to 1.32.0 I see the frequency of this has reduced but still it is happening for few consumers in our cluster. If this is expected, is there a plan to fix this in the near future?

slaunay commented 2 years ago

@shweta-fourkites I was able to reproduce what looks like #2150 yesterday.

There seem to be a race condition during retries where:

It's not clear to me if this was also possible before honouring produce request pipelining.

Can you confirm that when the producer stops producing to that particular topic partition, a Kafka record with a null key and value was produced in the target Kafka topic?

I was going to submit a simple test case to reproduce it with a fix but there seem to be a fix already in #2182.

@dnwe Should we add another retract directive for v1.32.0 referencing #2150 (and marking that issue as fixed) when v1.32.1 is released and if #2182 gets merged?

shweta-fourkites commented 2 years ago

Hi @slaunay @niamster @dnwe , I patched the changes in https://github.com/Shopify/sarama/pull/2182 and I am no longer facing the issue! Are we planning to get this merged and release a new version with the fix anytime soon?

slaunay commented 2 years ago

Thanks a lot for testing the fix @shweta-fourkites and confirming that the issue is fixed.

I'm not a maintainer for Sarama but I believe the goal is indeed to have the fix merged, add a retract for v1.32.0 and release a new version.

softwarespot commented 2 years ago

For now we have reverted back to 1.30.1 until this has been released

mkevac commented 2 years ago

Decided to downgrade to 1.30.1 too as I have stumbled upon same bug.

eafzali commented 2 years ago

We are also still seeing the issue happening in 1.32.0, it's happening less often than 1.31.0 but still happens from time to time. Is there any plans for releasing the fix?

dnwe commented 2 years ago

@eafzali we are basically ready to cut a release, but we were waiting for more confirmation that github.com/Shopify/Sarama@main fixed the issue for people before cutting the release. Have you managed to test with main?

eafzali commented 2 years ago

Not yet, we have switched back to 1.30.1 and so far we didn't have a problem. We can try out main but it's really hard for us to "confirm" anything as it randomly happens for us once or twice a week. I guess related to the stability of our Kafka Cluster :)

shweta-fourkites commented 2 years ago

@dnwe any idea when the changes from https://github.com/Shopify/sarama/pull/2182 can be released to a stable branch for use in production?

david-bergman commented 2 years ago

experiencing the same issue with sarama v1.37.2 connecting to Confluent cloud kafka instances. write: broken pipe

shmilyoo commented 1 year ago

same error in v1.37.2 producer on error callback : kafka: Failed to produce message to topic apm-aidc-aggs-sync: write tcp ip:port->kafka-ip:9092: write: broken pipe

david-bergman commented 1 year ago

issue continues with sarama v1.38.1... write tcp ... write: broken pipe

david-bergman commented 1 year ago

perhaps with the backing of IBM this issue can finally be addressed ? using v1.38.2-0.20230327141928-9127f1c432c0