Closed luoyetx closed 9 months ago
I modify the source code to fix the admin restart coredump issue by following
我先看下,你可以先提交一个pr
上面的 patch 只修复了 catalog 重启时的 coredump,不确定 suez_admin_worker 其他地方重启有没有问题,等你们官方修复吧 😂
coredump的问题已经修复了,但是还没有发布,我们跟这个问题一起发
更新代码后试了几次,发现 qrs 基本不会重启了,但是 searcher 大概率还是重启,日志如下,needlaunch 检查发现进程不存在,但是这个很诡异,因为下次重启的时候又能拿到 pid 去 kill
[2024-01-26 16:31:44.336930] [DEBUG] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- needLaunch():258] [need check slot status, slotId address:172.16.32.21, slotId id:0, currentTime:1706257904336929 lastCheckTime:1706257252400647]
[2024-01-26 16:31:44.336950] [INFO] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():236] [begin execute cmd:ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1']
[2024-01-26 16:32:14.341170] [INFO] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1'] success, out[], code[0]]
[2024-01-26 16:32:14.341194] [ERROR] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- checkProcessExist():228] [get pid from msg[] failed]
[2024-01-26 16:32:14.341203] [DEBUG] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- needLaunch():265] [slot not running, need launch]
[2024-01-26 16:32:14.341216] [INFO] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- asyncLaunchOneSlot():145] [start process for role:[database.database_partition_0], slaveAddr:[172.16.32.21], slot:[0], declareTime:1706255194]
[2024-01-26 16:32:14.547638] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1'] success, out[74], code[0]]
[2024-01-26 16:32:14.547664] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- checkProcessExist():231] [check process, cmd[docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\""] msg[74]]
[2024-01-26 16:32:14.547677] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():236] [begin execute cmd:ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74" 2>&1']
[2024-01-26 16:32:14.743193] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74" 2>&1'] success, out[], code[0]]
[2024-01-26 16:32:14.743237] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- stopProcess():200] [stop process, cmd[docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74"] msg[]]
code
when restart suze_admin_worker or transfer leadership of admin(kill current admin leader process), it will recreate all searcher/qrs containers.