cloudwu / skynet

A lightweight online game framework
MIT License
13.36k stars 4.2k forks source link

使用redis集群,偶现发生moved,然后连接不上,执行报错 #2000

Open yfengworld opened 5 days ago

yfengworld commented 5 days ago

错误日志: [:0000009a][ERROR][00:00:45.45][err_handle.lua:7] ../skynet/lualib/skynet/db/redis/cluster.lua:407: Too many Cluster redirections?,maybe node is disconnected (last error: " 15067 172.16.2.67:6379") stack traceback: ../src/share/libs/err_handle.lua:6: in function 'err_handle.error_handler' [C]: in function 'error' ../skynet/lualib/skynet/db/redis/cluster.lua:407: in function <../skynet/lualib/skynet/db/redis/cluster.lua:315> (...tail calls...)

而且进程出现异常后,debug_console连接上马上就被关闭,无法使用

cloudwu commented 5 days ago

看代码找问题。

如果无法建立新连接,检查最大文件数。

cc @sundream

yfengworld commented 5 days ago

看代码找问题。

如果无法建立新连接,检查最大文件数。

cc @sundream

根据日志定位到cluster.lua文件里rediscluster:call(...)函数,看到发生moved,返回了正确的ip和端口,但是再次执行 local result = {pcall(function () -- TODO: use pipelining to send asking and save a rtt. if asking then conn:asking() end asking = false local func = conn[cmd] return func(conn,table.unpack(argv,2)) end)} local ok = result[1] if not ok then err = table.unpack(result,2) err = tostring(err) syslog.error("rediscluster socket error %s", err)

这里的err打印../skynet/lualib/skynet/socketchannel.lua:482: MOVED 1918 172.16.2.207:6379,然后重试结束后抛出错误。检查发现key也确实在172.16.2.207这个节点,

查看文件描述符大小 root@ybxz-obt-center:/data/ybxz-obt# ulimit -n 102400

firedtoad commented 5 days ago

需要提前计算节点ID

yfengworld @.***> 于2024年11月25日周一 17:35写道:

看代码找问题。

如果无法建立新连接,检查最大文件数。

cc @sundream https://github.com/sundream

根据日志定位到cluster.lua文件里rediscluster:call(...)函数,看到发生moved,返回了正确的ip和端口,但是再次执行 local result = {pcall(function () -- TODO: use pipelining to send asking and save a rtt. if asking then conn:asking() end asking = false local func = conn[cmd] return func(conn,table.unpack(argv,2)) end)} local ok = result[1] if not ok then err = table.unpack(result,2) err = tostring(err) syslog.error("rediscluster socket error %s", err)

这里的err打印../skynet/lualib/skynet/socketchannel.lua:482: MOVED 1918 172.16.2.207:6379,然后重试结束后抛出错误。检查发现key也确实在172.16.2.207这个节点

— Reply to this email directly, view it on GitHub https://github.com/cloudwu/skynet/issues/2000#issuecomment-2497415541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6QJQD36UIF64XBLBDXTD2CLVNPAVCNFSM6AAAAABSNNKRKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJXGQYTKNJUGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

yfengworld commented 4 days ago

需要提前计算节点ID yfengworld @.> 于2024年11月25日周一 17:35写道: 看代码找问题。 如果无法建立新连接,检查最大文件数。 cc @sundream https://github.com/sundream 根据日志定位到cluster.lua文件里rediscluster:call(...)函数,看到发生moved,返回了正确的ip和端口,但是再次执行 local result = {pcall(function () -- TODO: use pipelining to send asking and save a rtt. if asking then conn:asking() end asking = false local func = conn[cmd] return func(conn,table.unpack(argv,2)) end)} local ok = result[1] if not ok then err = table.unpack(result,2) err = tostring(err) syslog.error("rediscluster socket error %s", err) 这里的err打印../skynet/lualib/skynet/socketchannel.lua:482: MOVED 1918 172.16.2.207:6379,然后重试结束后抛出错误。检查发现key也确实在172.16.2.207这个节点 — Reply to this email directly, view it on GitHub <#2000 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6QJQD36UIF64XBLBDXTD2CLVNPAVCNFSM6AAAAABSNNKRKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJXGQYTKNJUGE . You are receiving this because you are subscribed to this thread.Message ID: @.>

什么意思?

firedtoad commented 4 days ago

大概率你没连接所有的节点

yfengworld commented 3 days ago

因为开了一定数量的公会服务处理公会,每个服务连接一个redis集群。怀疑是连接太多。公会服务压缩后,问题不再出现。但是不确定具体的原因