childe / healer

golang lib for kafka
22 stars 18 forks source link

slices out of bounds error #2

Closed sundy-li closed 5 years ago

sundy-li commented 5 years ago

Here is the panic log trace


E1205 17:16:26.100794   50137 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1205 17:16:26.143547   50137 simple_consumer.go:246] fetch error:read tcp4 10.68.9.150:63224->10.68.9.3:7560: use of closed network connection
panic: runtime error: slice bounds out of range

goroutine 18179 [running]:
github.com/childe/healer.NewProtocolMetadata(0xc4ffcee320, 0x1e, 0x1e, 0x12)
    /usr/local/gopath/src/github.com/childe/healer/group.go:198 +0x316
github.com/childe/healer.(*GroupConsumer).getTopicPartitionInfo(0xc42061c240)
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:92 +0x168
github.com/childe/healer.(*GroupConsumer).sync(0xc42061c240, 0x0, 0x0)
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:219 +0x541
github.com/childe/healer.(*GroupConsumer).joinAndSync(0xc42061c240, 0xc26580, 0xc4205b1c38)
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:263 +0x279
github.com/childe/healer.(*GroupConsumer).consumeWithoutHeartBeat(0xc42061c240, 0x0, 0xc440127080, 0x0, 0x0, 0x0)
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:443 +0xb9
github.com/childe/healer.(*GroupConsumer).Consume.func1(0xc44044e190, 0xc42061c240)
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:400 +0x118
created by github.com/childe/healer.(*GroupConsumer).Consume
    /usr/local/gopath/src/github.com/childe/healer/group_consumer.go:393 +0xba
childe commented 5 years ago

提交了 41a2dca4c9400f546062e0d38baacb51698b541b .

是我代码写的不严谨.

我估计是这个原因引起的, a := binary.BigEndian.Uint32(XXX) , 这里的XXX是 []byte{255,255,255,255}, 如果把 a 转成int, 会得到4294967295, 导致接下来的操作panic, 如果转成 int32, 会得到 -1 .

我现在改成了转 int32, 并判断是不是等于 -1

我们之所以没遇到这个问题, 我想是用的库不一样的原因吧. 比如有的kafka client库, 提交 join request的时候, encode 这里的二进制数据, 用的是 []byte{255,255,255,255} . 我这边使用 healer 做 decode 的时候就 panic 了. 官方的库做encode, 应该是用的 []byte{0, 0, 0 , 0} , 所以我这边一直没碰到这个问题.

不管怎么说, 是我代码不严谨. 可能还有别的地方有这个问题.. 我都去测一下. 你们还用什么库做消费? 我想拿来做下测试.

多谢你的 issue !

sundy-li commented 5 years ago

@childe 没明白你说的库是什么意思,客户端就是healer的 group_consumer api消费 (类似你的gohangout用法), 服务端kafka版本是 2.0, 另外就是在rebalance的过程中经常有 send to closed channel 错误,但不会导致panic, 可能是rebalance的过程中,将partition分给其他消费者关闭channel时机不太对

childe commented 5 years ago

@sundy-li 如果更新代码之后, 没有这个panic, 应该就是我上面的提到的问题.
就是你之前用其他的库(比如说以sarama基础的第三方库) 做group consumer消费, 这个group的信息encode成二进制数据存到了Kafka. 我们对一些数据的 encode/decode 方式不一样, 主要是对长度为0的 []byte/string 做编码的时候. 我这边用了 int(0), sarama应该是用的 int(-1). 所以我这边读取这些二进制信息做decode的时候出错了.

我看了一下文档, -1 也许更合理一些 , 代表后面的 []byte为nil .

kafka协议的官方文档这么说的

Variable Length Primitives

bytes, string - These types consist of a signed integer giving a length N followed by N bytes of content. A length of -1 indicates null. string uses an int16 for its size, and bytes uses an int32.

send to closed channel 这个的详细错误栈能不能发我看一下? 谢谢啦~

sundy-li commented 5 years ago

@childe 你好,send to closed channel 这个错误最新版本跑已经没有了

我咨询个问题,我发现rebalance的时间比sarama客户端长

比如,我有30个消费者 消费120个分区的大topic,稳定运行后,我并行重启这30个消费进程,发现rebalance的时间甚至到十几分钟或者半个小时,这方面可以优化吗?

E1207 10:13:03.568039   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:29938->10.68.33.108:7560: use of closed network connection
E1207 10:13:03.778461   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:62642->10.68.33.103:7560: use of closed network connection
E1207 10:13:33.139220   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:13:33.149546   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:24934->10.68.33.109:7560: use of closed network connection
E1207 10:13:33.202594   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:26278->10.68.33.104:7560: use of closed network connection
E1207 10:13:33.239131   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:31344->10.68.33.108:7560: use of closed network connection
E1207 10:13:33.383066   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:64048->10.68.33.103:7560: use of closed network connection
E1207 10:14:03.139777   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:14:03.157484   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:26298->10.68.33.109:7560: use of closed network connection
E1207 10:14:03.182668   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:27636->10.68.33.104:7560: use of closed network connection
E1207 10:14:03.215406   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:32708->10.68.33.108:7560: use of closed network connection
E1207 10:14:03.384824   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:65410->10.68.33.103:7560: use of closed network connection
E1207 10:14:33.144714   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:14:33.167065   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:27670->10.68.33.109:7560: use of closed network connection
E1207 10:14:33.181062   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:29016->10.68.33.104:7560: use of closed network connection
E1207 10:14:33.229054   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:34088->10.68.33.108:7560: use of closed network connection
E1207 10:14:33.355211   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:21254->10.68.33.103:7560: use of closed network connection
E1207 10:15:04.556569   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:15:04.572135   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:28980->10.68.33.109:7560: use of closed network connection
E1207 10:15:04.673974   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:30320->10.68.33.104:7560: use of closed network connection
E1207 10:15:04.734564   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:35396->10.68.33.108:7560: use of closed network connection
E1207 10:15:04.861249   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:22564->10.68.33.103:7560: use of closed network connection
E1207 10:15:36.158639   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:15:36.172482   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:30218->10.68.33.109:7560: use of closed network connection
E1207 10:15:36.307726   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:31558->10.68.33.104:7560: use of closed network connection
E1207 10:15:36.327458   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:36626->10.68.33.108:7560: use of closed network connection
E1207 10:15:36.411182   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:23794->10.68.33.103:7560: use of closed network connection
E1207 10:16:06.139296   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:16:06.164533   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:31534->10.68.33.109:7560: use of closed network connection
E1207 10:16:06.211382   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:32878->10.68.33.104:7560: use of closed network connection
E1207 10:16:06.226074   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:37952->10.68.33.108:7560: use of closed network connection
E1207 10:16:06.347554   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:25120->10.68.33.103:7560: use of closed network connection
E1207 10:16:36.139142   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:16:36.142353   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:32862->10.68.33.109:7560: use of closed network connection
E1207 10:16:36.207516   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:34206->10.68.33.104:7560: use of closed network connection
E1207 10:16:36.252442   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:39272->10.68.33.108:7560: use of closed network connection
E1207 10:16:36.387203   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:26440->10.68.33.103:7560: use of closed network connection
E1207 10:17:06.373766   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:17:06.431022   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:36046->10.68.33.104:7560: use of closed network connection
E1207 10:17:06.448750   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:34706->10.68.33.109:7560: use of closed network connection
E1207 10:17:06.498675   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:41114->10.68.33.108:7560: use of closed network connection
E1207 10:17:06.580894   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:28282->10.68.33.103:7560: use of closed network connection
E1207 10:17:39.139976   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:17:39.151675   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:36086->10.68.33.109:7560: use of closed network connection
E1207 10:17:39.192409   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:37424->10.68.33.104:7560: use of closed network connection
E1207 10:17:39.225551   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:42496->10.68.33.108:7560: use of closed network connection
E1207 10:17:39.352891   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:29662->10.68.33.103:7560: use of closed network connection
E1207 10:18:09.139683   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:18:09.151180   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:37488->10.68.33.109:7560: use of closed network connection
E1207 10:18:09.217101   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:43902->10.68.33.108:7560: use of closed network connection
E1207 10:18:09.293639   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:31074->10.68.33.103:7560: use of closed network connection
E1207 10:18:09.449402   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:38828->10.68.33.104:7560: use of closed network connection
E1207 10:18:39.146254   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:18:39.165343   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:38854->10.68.33.109:7560: use of closed network connection
E1207 10:18:39.221022   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:45260->10.68.33.108:7560: use of closed network connection
E1207 10:18:39.290007   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:32434->10.68.33.103:7560: use of closed network connection
E1207 10:18:39.294491   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:40192->10.68.33.104:7560: use of closed network connection
E1207 10:19:09.141735   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:19:09.146443   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:40134->10.68.33.109:7560: use of closed network connection
E1207 10:19:09.199092   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:41482->10.68.33.104:7560: use of closed network connection
E1207 10:19:09.258881   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:46548->10.68.33.108:7560: use of closed network connection
E1207 10:19:09.381222   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:33720->10.68.33.103:7560: use of closed network connection
E1207 10:19:42.149565   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:19:42.168604   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:41432->10.68.33.109:7560: use of closed network connection
E1207 10:19:42.191795   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:42776->10.68.33.104:7560: use of closed network connection
E1207 10:19:42.251914   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:47842->10.68.33.108:7560: use of closed network connection
E1207 10:19:42.328261   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:35010->10.68.33.103:7560: use of closed network connection
E1207 10:20:12.139289   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:20:12.143717   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:42826->10.68.33.109:7560: use of closed network connection
E1207 10:20:12.272556   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:44166->10.68.33.104:7560: use of closed network connection
E1207 10:20:12.338648   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:49240->10.68.33.108:7560: use of closed network connection
E1207 10:20:12.470534   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:36404->10.68.33.103:7560: use of closed network connection
E1207 10:20:42.170221   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:20:42.178809   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:44176->10.68.33.109:7560: use of closed network connection
E1207 10:20:42.255954   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:50592->10.68.33.108:7560: use of closed network connection
E1207 10:20:42.304489   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:37760->10.68.33.103:7560: use of closed network connection
E1207 10:20:42.458192   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:45516->10.68.33.104:7560: use of closed network connection
E1207 10:21:12.139576   46968 group_consumer.go:397] failed to send heartbeat:The group is rebalancing, so a rejoin is needed.
E1207 10:21:12.139676   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:45530->10.68.33.109:7560: use of closed network connection
E1207 10:21:12.202001   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:46866->10.68.33.104:7560: use of closed network connection
E1207 10:21:12.252845   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:51940->10.68.33.108:7560: use of closed network connection
E1207 10:21:12.367738   46968 simple_consumer.go:246] fetch error:read tcp4 10.68.9.147:39108->10.68.33.103:7560: use of closed network connection
sundy-li commented 5 years ago

image

监控界面也可以看到这个重启过程影响时间挺长

childe commented 5 years ago

rebalance时间太长是个挺麻烦的问题, 这一块我会花时间和sarama, 以及官方的库做对比. 我们也遇到过, 我们是 150 个 partition , 14 brokers. 十一期间以及十一前流量上来的时候, rebalance时间很长, 十一后就非常明显的好转, 所以我以为是 kafka 压力过大引起的. 而且当时我们也用官方的库测试过, 也会在 rebalance 阶段好久.

你的做对比的库是用的哪个?(sarama应该还不支持group consumer吧, 只有依赖他的第三方库) 我继续看一下这个问题.

sundy-li commented 5 years ago

最新版本的sarama 不用依赖其他库也支持 group consumer 了, 我们之前用的是 https://github.com/bsm/sarama-cluster 这个库,一天有3000亿流量走kafka, 但消费者压力很大,想尽可能去优化消费速度

sundy-li commented 5 years ago

https://github.com/Shopify/sarama/pull/1099

childe commented 5 years ago

https://github.com/childe/healer/issues/4 关于rebalance时间太长, 新开了一个 issue , 关掉这个了.

不过多说一句多于优化消费速度, 就我写这个库的时候感觉, 消费速度大家应该都差不多, 因为就是发送 fetch api, 然后流式的去解析.

之前 sarama 应该不是流式解析的, 所以当时早期版本消费速度特别慢. 现在我测试下来, 几乎一样的.

而且消费者压力大, 应该压力也不在消费这一块吧? 而是消费后去处理数据的压力? 我感觉是这样的.