Closed krasin-ga closed 1 year ago
The offset reset was triggered by RD_KAFKA_RESP_ERR__RESOLVE
that corresponds to a host resolution failure during an offset validation.
/* Permanent error */
rd_kafka_offset_reset(
rktp, rd_kafka_broker_id(rkb),
RD_KAFKA_FETCH_POS(RD_KAFKA_OFFSET_INVALID,
rktp->rktp_leader_epoch),
RD_KAFKA_RESP_ERR__LOG_TRUNCATION,
"Unable to validate offset and epoch: %s",
rd_kafka_err2str(err));
We need to remove this reset and retry even in case of permanent error here, as in Java
Description
We were testing a network outage scenario where one of three data centers became unavailable and noticed strange fluctuations and resets in committed offsets after the data center went back online. I've observed some anomalies in the logs that might be related to it.
Temporary errors in host resolution that result in the resetting of offsets
Suspicious updates of committed offsets
2023-09-08T00:13:46.246Z
// Started at correct offset2023-09-08T00:13:50.051Z
// Race condition? Committed offset and leader epoch for partition 0 is from partition 7 (see log below)2023-09-08T00:13:50.058Z
2023-09-08T00:13:55.056Z
// Back to normalHere is the graph displaying the committed offsets by partitions for that consumer group:
Please note that on our test bench the probability of encountering a race condition increases, because the Kubernetes pods running the consumer are constantly being throttled.
Checklist
Please provide the following information:
debug=..
as necessary) from librdkafka