Open nevillus opened 4 months ago
With the setup described above, we encountered some instances of "The broker received an out of order sequence number" errors recently too. These occurrences are very rare too, but we are wondering if this could indicate an issue with how the messages are being pushed, leading to them being ordered incorrectly.
So this appears to be an ordering issue / race condition between new batches being produced and batches being retried in the idempotent producer:
This shouldn't occur with config.Net.MaxOpenRequests = 1
, but we have had other reports (e.g., https://github.com/IBM/sarama/issues/2619) suggesting that when request pipelining was introduced it inadvertently changed the behaviour of the producer such that it lost some of its ordering guarantees
Thank you, @dnwe. Is there currently someone addressing this issue? If not, we're willing to assist and contribute to a solution. Could you provide some guidance on where we might start or what to look into?
I was able to reproduce with a simple async producer that sets:
config.Net.MaxOpenRequests = 1
config.Producer.Idempotent = true
In my case, the trigger that causes the assertion failed: message out of sequence added to a batch
message is to interrupt network connectivity between the Sarama client and brokers (connecting to / disconnecting from a VPN).
I don't see the same problem if I switch to using the sync producer in a loop (keeping the same configuration). I suspect this is because my test program will block until Kafka acks each message - effectively preventing the possibility of there being more than one request in flight at any time.
Description
We are encountering an error (once every few weeks) while using the async producer in our Kafka setup. The error message encountered is as follows:
This error seems to originate from the following line in the Sarama library: produce_set.go#L89
The occurrence of this error is sporadic, and we are struggling to understand the underlying cause or identify any corrective measures. It appears that, occasionally, messages are being added to the batch in an incorrect order.
We are seeking insights or suggestions on what might be triggering this error. Our investigations have considered network issues as a potential cause; however, we have not found any corresponding logs or indicators to substantiate this theory when the error occurs.
Versions
Configuration
Logs
We are facing the error detailed at the following location: produce_set.go#L89
Additional Context
All messages are dispatched using an asynchronous producer, configured with a high retry count to ensure message delivery even in the event of transient Kafka broker failures. Despite this, we observe that occasionally a message fails to be added to the batch, rendering it ineligible for any retry mechanism in Sarama.