alibaba / havenask

Apache License 2.0
1.6k stars 302 forks source link

[Question] Why set restart=true by default in hippo/default/ProcessStartWorkItem.cpp #239

Closed luoyetx closed 9 months ago

luoyetx commented 10 months ago

code

when restart suze_admin_worker or transfer leadership of admin(kill current admin leader process), it will recreate all searcher/qrs containers.

luoyetx commented 10 months ago

I modify the source code to fix the admin restart coredump issue by following

image
xuxijie commented 10 months ago

我先看下,你可以先提交一个pr

luoyetx commented 10 months ago

上面的 patch 只修复了 catalog 重启时的 coredump,不确定 suez_admin_worker 其他地方重启有没有问题,等你们官方修复吧 😂

xuxijie commented 10 months ago

coredump的问题已经修复了,但是还没有发布,我们跟这个问题一起发

luoyetx commented 10 months ago

更新代码后试了几次,发现 qrs 基本不会重启了,但是 searcher 大概率还是重启,日志如下,needlaunch 检查发现进程不存在,但是这个很诡异,因为下次重启的时候又能拿到 pid 去 kill

[2024-01-26 16:31:44.336930] [DEBUG] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- needLaunch():258] [need check slot status, slotId address:172.16.32.21, slotId id:0, currentTime:1706257904336929 lastCheckTime:1706257252400647]
[2024-01-26 16:31:44.336950] [INFO] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():236] [begin execute cmd:ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1']
[2024-01-26 16:32:14.341170] [INFO] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1'] success, out[], code[0]]
[2024-01-26 16:32:14.341194] [ERROR] [18649,aios/hippo/src/sdk/default/CmdExecutor.cpp -- checkProcessExist():228] [get pid from msg[] failed]
[2024-01-26 16:32:14.341203] [DEBUG] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- needLaunch():265] [slot not running, need launch]
[2024-01-26 16:32:14.341216] [INFO] [18649,aios/hippo/src/sdk/default/DefaultProcessLauncher.cpp -- asyncLaunchOneSlot():145] [start process for role:[database.database_partition_0], slaveAddr:[172.16.32.21], slot:[0], declareTime:1706255194]
[2024-01-26 16:32:14.547638] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\"" 2>&1'] success, out[74], code[0]]
[2024-01-26 16:32:14.547664] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- checkProcessExist():231] [check process, cmd[docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "pgrep ha_sql|xargs pwdx|grep /root/havenask-sql-remote_database.database_partition_0/ha_sql|cut -d: -f1|tr -d \"
\""] msg[74]]
[2024-01-26 16:32:14.547677] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():236] [begin execute cmd:ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74" 2>&1']
[2024-01-26 16:32:14.743193] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- execute():243] [exec command:[ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no 172.16.32.21 'docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74" 2>&1'] success, out[], code[0]]
[2024-01-26 16:32:14.743237] [INFO] [18725,aios/hippo/src/sdk/default/CmdExecutor.cpp -- stopProcess():200] [stop process, cmd[docker exec -u root -t havenask_container_havenask-sql-remote_0 /bin/bash -c "kill -10 74"] msg[]]