IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.57k stars 1.76k forks source link

ErrUnknownProducerID should be retried #1737

Closed cweerasooriya closed 1 year ago

cweerasooriya commented 4 years ago
Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama Kafka Go
v1.24.1 2.1.0 1.13.1
Configuration

What configuration values are you using for Sarama and Kafka?

    saramaConfig := sarama.NewConfig()
    saramaConfig.Version = sarama.V2_1_0_0
    saramaConfig.Net.TLS.Enable = true
    saramaConfig.Net.TLS.Config = tls

    saramaConfig.Net.MaxOpenRequests = 1
    saramaConfig.Net.KeepAlive = 30*time.Second

    saramaConfig.Producer.Partitioner = sarama.NewReferenceHashPartitioner
    saramaConfig.Producer.Compression = sarama.CompressionSnappy
    saramaConfig.Producer.RequiredAcks = sarama.WaitForAll

    saramaConfig.Producer.Idempotent = true

    saramaConfig.Producer.Return.Successes = true
    saramaConfig.Producer.Return.Errors = true

    saramaConfig.Producer.Retry.Max = 300
    saramaConfig.Producer.Retry.Backoff = 100*time.Millisecond

    saramaConfig.Producer.Flush.Bytes = 100000
    saramaConfig.Producer.Flush.Frequency = 100*time.Millisecond

    if config.SaramaConfigCallback != nil {
        saramaConfig = config.SaramaConfigCallback(saramaConfig)
    }
Logs

When filing an issue please provide logs from Sarama and Kafka if at all possible. You can set sarama.Logger to a log.Logger to capture Sarama debug output.

logs: CLICK ME

``` kafka server: The broker could not locate the producer metadata associated with the Producer ID. ```

Problem Description

We are publishing to a low traffic topic. Our producer encountered this error yesterday. According to https://issues.apache.org/jira/browse/KAFKA-7190 this error could happen in low traffic topics and the producer should retry on the error.

We observed that Sarama is not retrying on this error which causes the producer to fail.

twmb commented 4 years ago

Unless I'm mistaken, KIP-360 (mentioned in that ticket) implies that it's fundamentally unsafe to retry on UnknownProducerID. UnknownProducerID was originally intended to allow clients to recover sending records, but the way to retry when a client sees UnknownProducerID is for the client to reset the sequence numbers of records. This effectively makes the client act as if it is publishing to the partition anew. KIP-360 explains that this is fundamentally unsafe and can lead to mistaken duplicates. The key line: "For the idempotent producer, the user can choose to fail or they can continue (with the possibility of duplication or reordering). If the user continues, the epoch will be bumped locally and the sequence number will be reset."

I think a fix for this in Sarama would be a config knob where users can opt in to unsafe recovery (this is similar to the path I chose on my own client, where instead users need to opt out of automatically continuing on potential data loss).

varun06 commented 4 years ago

I agree with @twmb above. It is definitely unsafe to retry on UnknownProducerID.

If we feel that fix is to provide a config option to understand the consequence and keep going, let's talk more about that. Pinging @d1egoaz @bai

ghost commented 3 years ago

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the master branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.