IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.42k stars 1.75k forks source link

Sarama has stuck on the batch with zero records length. #2053

Open pachmu opened 2 years ago

pachmu commented 2 years ago
Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama Kafka Go
Version 1.29.1 AWS Managed Kafka 1.17
Configuration

What configuration values are you using for Sarama and Kafka?

                 cfg.Net.MaxOpenRequests = 1
        // consume all messages
        cfg.Consumer.IsolationLevel = sarama.ReadUncommitted
        // require producer idempotency
        cfg.Producer.Idempotent = true
        // require consistency on Kafka side before considering the message produced
        cfg.Producer.RequiredAcks = sarama.WaitForAll
        // specify network keep-alive to reduce 'broken pipe' errors on non-KMS environments
        cfg.Net.KeepAlive = time.Second * 30
        // increase processing time from 1s (default) to 5s
        cfg.Consumer.MaxProcessingTime = time.Second * 5
Logs

When filing an issue please provide logs from Sarama and Kafka if at all possible. You can set sarama.Logger to a log.Logger to capture Sarama debug output.

logs: CLICK ME

``` 2021/10/22 10:38:26.384689 lvl=INF component=kafka records batch &{5805968 49 2 snappy 0 false false 0 2021-09-30 11:45:34.617 +0200 CEST 2021-09-30 11:45:34.617 +0200 CEST 82609 0 1563 [0xc00030e150] false false [] 1326} 2021/10/22 10:38:26.384705 lvl=INF component=kafka records batch &{5806083 49 2 snappy 0 false false 0 2021-09-30 12:00:01.188 +0200 CEST 2021-09-30 12:00:01.188 +0200 CEST 82609 0 1678 [0xc00030e1c0] false false [] 733} 2021/10/22 10:38:26.384728 lvl=INF component=kafka records batch &{5806313 49 2 snappy 0 false false 0 2021-09-30 12:29:55.542 +0200 CEST 2021-09-30 12:29:55.542 +0200 CEST 82609 0 1908 [0xc00030e230] false false [] 10546} 2021/10/22 10:38:26.384751 lvl=INF component=kafka records batch &{5806399 49 2 snappy 0 false false 0 2021-09-30 12:42:07.755 +0200 CEST 2021-09-30 12:42:07.755 +0200 CEST 82609 0 1994 [0xc00030e2a0] false false [] 1130} 2021/10/22 10:38:26.384772 lvl=INF component=kafka records batch &{5806689 49 2 snappy 0 false false 0 2021-09-30 13:25:25.119 +0200 CEST 2021-09-30 13:25:25.119 +0200 CEST 82609 0 2284 [0xc00030e380] false false [] 2934} 2021/10/22 10:38:26.384790 lvl=INF component=kafka records batch &{5807057 49 2 snappy 0 false false 0 2021-09-30 14:24:12.229 +0200 CEST 2021-09-30 14:24:12.229 +0200 CEST 82609 0 2652 [0xc00030e3f0] false false [] 3124} 2021/10/22 10:38:26.384807 lvl=INF component=kafka records batch &{5807106 49 2 snappy 0 false false 0 2021-09-30 14:31:33.337 +0200 CEST 2021-09-30 14:31:33.337 +0200 CEST 82609 0 2701 [0xc00030e460] false false [] 2335} 2021/10/22 10:38:26.384824 lvl=INF component=kafka records batch &{5807150 49 2 snappy 0 false false 0 2021-09-30 14:38:49.18 +0200 CEST 2021-09-30 14:38:49.18 +0200 CEST 82609 0 2745 [0xc00030e4d0] false false [] 2917} 2021/10/22 10:38:26.384858 lvl=INF component=kafka records batch &{5807374 49 2 snappy 0 false false 0 2021-09-30 15:12:07.587 +0200 CEST 2021-09-30 15:12:07.587 +0200 CEST 82609 0 2969 [0xc00030e540] false false [] 5744} 2021/10/22 10:38:26.384933 lvl=INF component=kafka records batch &{5807388 49 2 snappy 0 false false 0 2021-09-30 15:14:10.358 +0200 CEST 2021-09-30 15:14:10.358 +0200 CEST 82609 0 2983 [0xc00030e5b0] false false [] 31539} 2021/10/22 10:38:26.384958 lvl=INF component=kafka records batch &{5807391 49 2 snappy 0 false false 0 2021-09-30 15:14:44.664 +0200 CEST 2021-09-30 15:14:44.664 +0200 CEST 82609 0 2986 [0xc00030e620] false false [] 2135} 2021/10/22 10:38:26.385027 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.418552 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.450115 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.481958 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.522805 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.557499 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.593690 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.625917 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.657295 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.775601 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.806612 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.840360 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} 2021/10/22 10:38:26.871158 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0} ```

Problem Description

I'm using Goka (Sarama based library) and have some problems with reading from the one partition of 4. The consumer has stuck on the one particular batch and rereads it again and again.

Here is some info: This is batch struct:

type RecordBatch struct {
    FirstOffset           int64
    PartitionLeaderEpoch  int32
    Version               int8
    Codec                 CompressionCodec
    CompressionLevel      int
    Control               bool
    LogAppendTime         bool
    LastOffsetDelta       int32
    FirstTimestamp        time.Time
    MaxTimestamp          time.Time
    ProducerID            int64
    ProducerEpoch         int16
    FirstSequence         int32
    Records               []*Record
    PartialTrailingRecord bool
    IsTransactional       bool

    compressedRecords []byte
    recordsLen        int // uncompressed records size
}

I added some logs for all decoded batches and firstly I see this:

2021/10/22 10:38:26.384671 lvl=INF component=kafka records batch &{5805856 49 2 snappy 0 false false 0 2021-09-30 11:29:46.662 +0200 CEST 2021-09-30 11:29:46.662 +0200 CEST 82609 0 1451 [0xc00030e0e0] false false [] 1735}
2021/10/22 10:38:26.384689 lvl=INF component=kafka records batch &{5805968 49 2 snappy 0 false false 0 2021-09-30 11:45:34.617 +0200 CEST 2021-09-30 11:45:34.617 +0200 CEST 82609 0 1563 [0xc00030e150] false false [] 1326}
2021/10/22 10:38:26.384705 lvl=INF component=kafka records batch &{5806083 49 2 snappy 0 false false 0 2021-09-30 12:00:01.188 +0200 CEST 2021-09-30 12:00:01.188 +0200 CEST 82609 0 1678 [0xc00030e1c0] false false [] 733}
2021/10/22 10:38:26.384728 lvl=INF component=kafka records batch &{5806313 49 2 snappy 0 false false 0 2021-09-30 12:29:55.542 +0200 CEST 2021-09-30 12:29:55.542 +0200 CEST 82609 0 1908 [0xc00030e230] false false [] 10546}
2021/10/22 10:38:26.384751 lvl=INF component=kafka records batch &{5806399 49 2 snappy 0 false false 0 2021-09-30 12:42:07.755 +0200 CEST 2021-09-30 12:42:07.755 +0200 CEST 82609 0 1994 [0xc00030e2a0] false false [] 1130}
2021/10/22 10:38:26.384772 lvl=INF component=kafka records batch &{5806689 49 2 snappy 0 false false 0 2021-09-30 13:25:25.119 +0200 CEST 2021-09-30 13:25:25.119 +0200 CEST 82609 0 2284 [0xc00030e380] false false [] 2934}
2021/10/22 10:38:26.384790 lvl=INF component=kafka records batch &{5807057 49 2 snappy 0 false false 0 2021-09-30 14:24:12.229 +0200 CEST 2021-09-30 14:24:12.229 +0200 CEST 82609 0 2652 [0xc00030e3f0] false false [] 3124}
2021/10/22 10:38:26.384807 lvl=INF component=kafka records batch &{5807106 49 2 snappy 0 false false 0 2021-09-30 14:31:33.337 +0200 CEST 2021-09-30 14:31:33.337 +0200 CEST 82609 0 2701 [0xc00030e460] false false [] 2335}
2021/10/22 10:38:26.384824 lvl=INF component=kafka records batch &{5807150 49 2 snappy 0 false false 0 2021-09-30 14:38:49.18 +0200 CEST 2021-09-30 14:38:49.18 +0200 CEST 82609 0 2745 [0xc00030e4d0] false false [] 2917}
2021/10/22 10:38:26.384858 lvl=INF component=kafka records batch &{5807374 49 2 snappy 0 false false 0 2021-09-30 15:12:07.587 +0200 CEST 2021-09-30 15:12:07.587 +0200 CEST 82609 0 2969 [0xc00030e540] false false [] 5744}
2021/10/22 10:38:26.384933 lvl=INF component=kafka records batch &{5807388 49 2 snappy 0 false false 0 2021-09-30 15:14:10.358 +0200 CEST 2021-09-30 15:14:10.358 +0200 CEST 82609 0 2983 [0xc00030e5b0] false false [] 31539}
2021/10/22 10:38:26.384958 lvl=INF component=kafka records batch &{5807391 49 2 snappy 0 false false 0 2021-09-30 15:14:44.664 +0200 CEST 2021-09-30 15:14:44.664 +0200 CEST 82609 0 2986 [0xc00030e620] false false [] 2135}

Looks ok. But then I see:

2021/10/22 10:38:26.385027 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.418552 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.450115 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.481958 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.522805 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.557499 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.593690 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.625917 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.657295 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.775601 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.806612 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.840360 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.871158 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.901792 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}
2021/10/22 10:38:26.934249 lvl=INF component=kafka records batch &{5807565 49 2 none 0 false false 0 0001-01-01 00:00:00 +0000 UTC 2021-09-30 15:43:16.281 +0200 CEST 82609 0 3160 [] false false [] 0}

and it keeps going indefinitely. I dug into the code and figured out that there is no exit from such situation when the records len is 0. Sarama just reads this batch again and again. A high watermark was not reached. We use AWS Managed Kafka in this case. Please help. We are firing.

grongor commented 2 years ago

Hello! I just hit the same error as you, but your solution doesn't seem to work as the app is stuck in a loop and the offset isn't changing... Any idea how to fix this? :)

2021-12-31T10:32:39.030 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.031 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.031 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.032 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.033 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.034 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.035 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.036 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.037 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.038 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.039 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.040 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.041 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.042 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.042 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.043 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.044 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.045 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.046 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.046 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.047 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.048 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.049 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.050 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.051 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.051 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.052 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.053 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.054 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.055 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.055 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.056 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.057 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.058 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.059 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.060 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.061 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.062 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.062 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.063 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.064 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.065 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
2021-12-31T10:32:39.066 consumer/broker/1004 received batch with zero records but high watermark was not reached, topic ***, partition 34, offset 15138131
github-actions[bot] commented 1 year ago

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] commented 10 months ago

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

kgatv2 commented 10 months ago

I experienced the same issue with Sarama 1.42.1. Apparently we had some issues with one of our Kafka nodes in the clusters that triggered this behaviour in our Sarama consumers. We have Kafka consumers in both Scala and Go, and only the Sarama client got into this endless loop, the Scala client handled the issue just fine. We are not sure what the issue with Kafka itself was yet, but apparently there is an edge case here with Sarama that make it enter into an endless loop with the error above and stops consuming messages.

github-actions[bot] commented 5 months ago

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.