IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.2k stars 1.73k forks source link

Sarama Async Producer Encounters 'Out of Order' Error: what are the reasons? #2803

Open nevillus opened 4 months ago

nevillus commented 4 months ago
Description

We are encountering an error (once every few weeks) while using the async producer in our Kafka setup. The error message encountered is as follows:

assertion failed: message out of sequence added to a batch

This error seems to originate from the following line in the Sarama library: produce_set.go#L89

The occurrence of this error is sporadic, and we are struggling to understand the underlying cause or identify any corrective measures. It appears that, occasionally, messages are being added to the batch in an incorrect order.

We are seeking insights or suggestions on what might be triggering this error. Our investigations have considered network issues as a potential cause; however, we have not found any corresponding logs or indicators to substantiate this theory when the error occurs.

Versions
Sarama Kafka Go
v1.42.1 2.6.2 1.20.6
Configuration
    config := sarama.NewConfig()
    config.Version = version
    config.Consumer.Group.Rebalance.Strategy = sarama.NewBalanceStrategySticky()
    config.Producer.RequiredAcks = sarama.WaitForAll
    config.Producer.Idempotent = true
    config.Net.MaxOpenRequests = 1
    config.Producer.Retry.Max = 100000
    config.Producer.Retry.Backoff = 100 * time.Millisecond
    config.Producer.Return.Successes = true
    config.Producer.Return.Errors = true
    config.Producer.Partitioner = sarama.NewHashPartitioner
Logs

We are facing the error detailed at the following location: produce_set.go#L89

Additional Context

All messages are dispatched using an asynchronous producer, configured with a high retry count to ensure message delivery even in the event of transient Kafka broker failures. Despite this, we observe that occasionally a message fails to be added to the batch, rendering it ineligible for any retry mechanism in Sarama.

nevillus commented 3 months ago

With the setup described above, we encountered some instances of "The broker received an out of order sequence number" errors recently too. These occurrences are very rare too, but we are wondering if this could indicate an issue with how the messages are being pushed, leading to them being ordered incorrectly.

dnwe commented 3 months ago

So this appears to be an ordering issue / race condition between new batches being produced and batches being retried in the idempotent producer:

https://github.com/IBM/sarama/blob/f21c5125746f9d10fd731dfdff54a494098626d1/async_producer.go#L1144-L1148

This shouldn't occur with config.Net.MaxOpenRequests = 1, but we have had other reports (e.g., https://github.com/IBM/sarama/issues/2619) suggesting that when request pipelining was introduced it inadvertently changed the behaviour of the producer such that it lost some of its ordering guarantees

nevillus commented 3 months ago

Thank you, @dnwe. Is there currently someone addressing this issue? If not, we're willing to assist and contribute to a solution. Could you provide some guidance on where we might start or what to look into?

prestona commented 2 months ago

I was able to reproduce with a simple async producer that sets:

config.Net.MaxOpenRequests = 1
config.Producer.Idempotent = true

In my case, the trigger that causes the assertion failed: message out of sequence added to a batch message is to interrupt network connectivity between the Sarama client and brokers (connecting to / disconnecting from a VPN).

I don't see the same problem if I switch to using the sync producer in a loop (keeping the same configuration). I suspect this is because my test program will block until Kafka acks each message - effectively preventing the possibility of there being more than one request in flight at any time.