Description

We were testing a network outage scenario where one of three data centers became unavailable and noticed strange fluctuations and resets in committed offsets after the data center went back online. I've observed some anomalies in the logs that might be related to it.

Temporary errors in host resolution that result in the resetting of offsets

[thrd:main]: test-topic [2]: offset reset (at offset INVALID (leader epoch 6059), broker 1014) to offset BEGINNING (leader epoch -1): Unable to validate offset and epoch: Local: Host resolution failure: Local: Partition log truncation detected

[thrd:main]: test-topic [19]: offset reset (at offset BEGINNING (leader epoch -1), broker 1014) to offset BEGINNING (leader epoch -1): failed to query logical offset: Local: Host resolution failure

Suspicious updates of committed offsets

2023-09-08T00:13:46.246Z // Started at correct offset

Consumer in the group "testbench-1000": "[thrd:main]: Partition test-topic [0] start fetching at offset 57842809 (leader epoch 7007)" Code: "FETCH"; SysLevel: Debug;

2023-09-08T00:13:50.051Z // Race condition? Committed offset and leader epoch for partition 0 is from partition 7 (see log below)

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [0]: stored offset INVALID (leader epoch -1), committed offset 55171745 (leader epoch 6476): not including in commit" Code: "OFFSET"; SysLevel: Debug;

2023-09-08T00:13:50.058Z

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [7]: stored offset 55181424 (leader epoch 6476), committed offset 55171745 (leader epoch 6476): setting stored offset 55181424 (leader epoch 6476) for commit" Code: "OFFSET"; SysLevel: Debug;

2023-09-08T00:13:55.056Z // Back to normal

Consumer in the group "testbench-1000": "[thrd:main]: Topic test-topic [0]: stored offset 57842809 (leader epoch 7007), committed offset 57842809 (leader epoch 7007): not including in commit" Code: "OFFSET";"

Here is the graph displaying the committed offsets by partitions for that consumer group:

Please note that on our test bench the probability of encountering a race condition increases, because the Kubernetes pods running the consumer are constantly being throttled.

Checklist

Please provide the following information:

[x] librdkafka version (release number or git tag): v2.2.0
[x] Apache Kafka version: v2.7.2
[x] librdkafka client configuration:
```
auto.offset.reset: earliest
```
[x] Operating system: Debian 11.7
[x] Provide logs (with debug=.. as necessary) from librdkafka
[x] Provide broker log excerpts: N/A
[x] Critical issue

confluentinc / librdkafka

Fluctuations and resets in committed offsets due to temporal network outage #4427

Description

Temporary errors in host resolution that result in the resetting of offsets

Suspicious updates of committed offsets

Checklist