CodisLabs / codis

Proxy based Redis cluster solution supporting pipeline and scaling dynamically
MIT License
13.16k stars 2.69k forks source link

proxy运行一段时间后挂掉 #429

Closed sorheart closed 9 years ago

sorheart commented 9 years ago

在测试环境测试codis,并没有对proxy加压力,但是运行一周左右发现proxy挂掉了 日志如下: 2015/09/10 16:32:12 topology.go:150: [WARN] topo event {Type:EventNodeDeleted State:StateSyncConnected Path:/zk/codis/db_test/proxy/proxy_1 Err:} 2015/09/10 16:32:12 topology.go:158: [WARN] {Type:EventNodeDeleted State:StateSyncConnected Path:/zk/codis/db_test/proxy/proxy_1 Err:} 2015/09/10 16:32:12 proxy.go:426: [INFO] got event proxy_1, {EventNodeDeleted StateSyncConnected /zk/codis/db_test/proxy/proxy_1 }, lastActionSeq -1 2015/09/10 16:32:12 proxy.go:358: [PANIC] get proxy info failed: proxy_1 [error]: zk: connection closed 5 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/models/proxy.go:212 github.com/wandoulabs/codis/pkg/models.GetProxyInfo 4 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/topology.go:105 github.com/wandoulabs/codis/pkg/proxy.(_Topology).GetProxyInfo 3 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:356 github.com/wandoulabs/codis/pkg/proxy.(_Server).processAction 2 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:438 github.com/wandoulabs/codis/pkg/proxy.(_Server).loopEvents 1 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:105 github.com/wandoulabs/codis/pkg/proxy.(_Server).serve 0 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:82 github.com/wandoulabs/codis/pkg/proxy.func·002 ... ... [stack]: 3 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:358 github.com/wandoulabs/codis/pkg/proxy.(*Server).processAction

一开始怀疑是zookeeper挂掉了,但是zkServer status看了下zookeeper正常 之后尝试重启proxy失败了,日志显示: 2015/09/14 15:49:38 proxy.go:173: [PANIC] create fence node failed [error]: zk: node already exists [stack]: 2 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:173 github.com/wandoulabs/codis/pkg/proxy.(*Server).register 1 /usr/local/codis/src/github.com/wandoulabs/codis/pkg/proxy/proxy.go:77 github.com/wandoulabs/codis/pkg/proxy.New 0 /usr/local/codis/src/github.com/wandoulabs/codis/cmd/proxy/main.go:183 main.main

请问遇到这些情况应该怎么处理?

yangzhe1991 commented 9 years ago

connection closed说明和zk的连接断了? 重启失败的问题可以见https://github.com/wandoulabs/codis/blob/master/doc/FAQ_zh.md#zk-node-already-exists 去zk删节点,也可以升级到master分支最新版

sorheart commented 9 years ago

是的,给我的感觉也是和zk断了,奇怪的是zookeeper本身运行没有问题。 有什么方法可以确认是什么导致和zk断了吗?

sorheart commented 9 years ago

翻了下zk的日志得到了如下日志: 2015-08-28 20:13:59,981 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14f7426a97e0001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2015-08-28 20:13:59,983 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /168.168.207.50:63450 which had sessionid 0x14f7426a97e0001

不知能否会有帮助。

yangzhe1991 commented 9 years ago

不太确定,可能是长连接被外部断掉了或者其他啥原因。目前的版本的proxy比较依赖zk长连接的稳定性

sorheart commented 9 years ago

那就我的理解,zk连接的不稳定会导致codis的部分模块不可用 那么可否提升这段程序的容错能力了? 或者加入告警功能,出现这种场景就直接短信通知维护人员,让人员可以及时介入。 或者现在是否有proxy的HA方案了,使得单个proxy挂掉不影响全局。

yangzhe1991 commented 9 years ago

proxy本来就是注册在zk上的,client可以通过监听zk自动更新可用的proxy列表,见https://github.com/wandoulabs/codis/blob/master/doc/tutorial_zh.md#ha

yangzhe1991 commented 9 years ago

至于因为zk不稳导致proxy挂掉,这个其实属于之前实现上采用了简单粗暴好实现的方案来确保数据的一致性。后面会重新设计整个系统,不依赖长连接的稳定性甚至不依赖zk从而提升稳定性。短期内如果有空的话可能会在2.0的版本上增强proxy的稳定性不会轻易退出

yangzhe1991 commented 9 years ago

另外,从2.0.5 开始,proxy实现了自动online以及自动处理zk节点已经存在的问题。所以其实现在已经可以通过脚本自动重启proxy来处理因为各种原因导致proxy自动退出的问题。