CodisLabs / codis

Proxy based Redis cluster solution supporting pipeline and scaling dynamically
MIT License
13.17k stars 2.69k forks source link

Fix the bug that healthy sentinel displays ERROR on the codis-fe #1730

Open yz1509 opened 3 years ago

yz1509 commented 3 years ago

现象

类似 #1345,查看 dashboard 状态,发现错误位于 InfomasterCommand 解析回包失败处:

错误信息在这两处切换变化

Inked001_LI Inked002_LI

查看 sentinel 客户端列表发现,dashboard 和 sentinel 之间的连接一直未关闭,不符合代码中出现 error 就 close 连接的逻辑:

# age=1247007
id=3499285 addr=127.0.0.227:43720 fd=53 name= age=1247007 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info

通过抓取 sentinel 收到的 Redis 请求发现显示 ERROR 的 sentinel 意外收到了两次 INFO 命令:

正常运行情况下应该是 INFO 后接 SENTINEL masters

1612775925.970175 [127.0.0.227:43720]  INFO
1612775925.970205 [127.0.0.227:43720]  SENTINEL masters
1612775926.970092 [127.0.0.227:43720]  INFO
1612775927.966160 [127.0.0.227:43720]  INFO
1612775927.966172 [127.0.0.227:43720]  SENTINEL masters
1612775928.978100 [127.0.0.227:43720]  INFO
1612775929.978165 [127.0.0.227:43720]  INFO
1612775929.978183 [127.0.0.227:43720]  SENTINEL masters
1612775930.986163 [127.0.0.227:43720]  INFO
1612775931.978156 [127.0.0.227:43720]  INFO
1612775931.978178 [127.0.0.227:43720]  SENTINEL masters
1612775932.982088 [127.0.0.227:43720]  INFO
1612775933.998090 [127.0.0.227:43720]  INFO
1612775933.998105 [127.0.0.227:43720]  SENTINEL masters
1612775934.998154 [127.0.0.227:43720]  INFO
1612775935.998092 [127.0.0.227:43720]  INFO
1612775935.998111 [127.0.0.227:43720]  SENTINEL masters
1612775937.  6160 [127.0.0.227:43720]  INFO
1612775938.  6160 [127.0.0.227:43720]  INFO
1612775938.  6173 [127.0.0.227:43720]  SENTINEL masters
1612775938.998154 [127.0.0.227:43720]  INFO
1612775940.  6157 [127.0.0.227:43720]  INFO
1612775940.  6180 [127.0.0.227:43720]  SENTINEL masters
1612775941. 14246 [127.0.0.227:43720]  INFO
1612775942.  6342 [127.0.0.227:43720]  INFO
1612775942.  6349 [127.0.0.227:43720]  SENTINEL masters
1612775943. 22168 [127.0.0.227:43720]  INFO
1612775944. 22161 [127.0.0.227:43720]  INFO
1612775944. 22178 [127.0.0.227:43720]  SENTINEL masters
1612775945. 18160 [127.0.0.227:43720]  INFO
1612775946. 22093 [127.0.0.227:43720]  INFO
1612775946. 22118 [127.0.0.227:43720]  SENTINEL masters
1612775947. 34166 [127.0.0.227:43720]  INFO
1612775948. 34163 [127.0.0.227:43720]  INFO
1612775948. 34183 [127.0.0.227:43720]  SENTINEL masters
1612775949. 46191 [127.0.0.227:43720]  INFO
1612775950. 42159 [127.0.0.227:43720]  INFO
1612775950. 42172 [127.0.0.227:43720]  SENTINEL masters

长时间观察发现 dashboard 所维护的 redis 客户端的 Recv-Q 一直不为 0:

tcp     1276      0 127.0.0.227:43720   127.0.0.63:20379      ESTABLISHED 55618/xxx_xxx_xxx

经过 tcpdump 抓包发现 Redis 客户端和 Sentinel 交互没有问题,问题出在 dashboard 从 TCP 缓冲区拉取数据时每次恰好都是拿到上一条请求的回包,从而导致解析回包错误:

Round Dashboard 请求 Dashboard 拉取的 Sentinel 回包 说明
a INFO SENTINEL masters 解析错误
a + 1 INFO INFO 拉取的其实是上个 INFO 请求的回包
SENTINEL masters INFO 解析错误
a + 2 INFO SENTINEL masters 解析错误
a + 3 INFO INFO 拉取的其实是上个 INFO 请求的回包
SENTINEL masters INFO 解析错误
a + 4 INFO SENTINEL masters 解析错误
... ... ... ...

发生此现象的原因猜测

newRedisStats 某次超时时,并未关闭该客户端,而是利用 defer 将未从输入缓冲区拉取请求回包的 redis 客户端添加至连接池导致的。

hiqsociety commented 1 year ago

can you push your request here? https://github.com/cloudxaas/codis