StarRocks / starrocks-kubernetes-operator

Kubernetes Operator for StarRocks
Apache License 2.0
137 stars 69 forks source link

"Unknown MySQL server host" during "helm uninstall.." for starrocks #558

Open bhaskarshashank99 opened 4 months ago

bhaskarshashank99 commented 4 months ago

Describe the bug

helm uninstall causes Segmentation fault. Following are the logs:

Aborted at 1715973973 (unix time) try "date -d @1715973973" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x0) received by PID 30 (TID 0x7fdf48b98640) from PID 0; stack trace: @ 0x83d2d9a google::(anonymous namespace)::FailureSignalHandler() @ 0x7fef4f499811 (unknown) @ 0x7fef4f498e15 (unknown) @ 0x7ff01e463520 (unknown) @ 0x0 (unknown) /opt/starrocks/cn_entrypoint.sh: line 185: 30 Segmentation fault $STARROCKS_HOME/bin/start_cn.sh $addition_args [Fri May 17 19:26:13 UTC 2024] Receives signal to exit ... [Fri May 17 19:26:13 UTC 2024] Can't find /opt/starrocks/cn/bin/cn.pid! [Fri May 17 19:26:13 UTC 2024] try to drop myself(kube-starrocks-cn-0.kube-starrocks-cn-search.db-a72264f6b04249b.svc.cluster.local) from FE ... ERROR 2005 (HY000): Unknown MySQL server host 'kube-starrocks-fe-service' (-3) [Fri May 17 19:26:19 UTC 2024] Got error 1, sleep and retry ... [Fri May 17 19:26:21 UTC 2024] try to drop myself(kube-starrocks-cn-0.kube-starrocks-cn-search.db-a72264f6b04249b.svc.cluster.local) from FE ... ERROR 2005 (HY000): Unknown MySQL server host 'kube-starrocks-fe-service' (-3) [Fri May 17 19:26:27 UTC 2024] Got error 1, sleep and retry ... [Fri May 17 19:26:29 UTC 2024] try to drop myself(kube-starrocks-cn-0.kube-starrocks-cn-search.db-a72264f6b04249b.svc.cluster.local) from FE ...

It goes in the infinite loop and print above log in loop. This delays the whole deletion flow. Kubernetes eventually kills the pod after it remains in this state for the maximum shutdown timeout.

Potential reason : The order in which FE and CN services are deleted. FE gets deleted first and then CN.

Expected behavior

Helm uninstall should terminate all involved services smoothly with no Segmentation fault or infinite loop.

Please complete the following information

yandongxiao commented 4 months ago

The first time encountering this error, can you provide the specific helm uninstall command you used?

yandongxiao commented 4 months ago

‌‌‌‌‌‌‌I can partially reproduce your error. The kube-starrocks-cn-0 pod remains in a terminating state and continuously logs the following errors:

[Wed Jun 26 18:52:42 CST 2024] Attempting to remove myself from the Frontend (FE) service as kube-starrocks-cn-0.kube-starrocks-cn-search.starrocks.svc.cluster.local...
ERROR 2005 (HY000): The MySQL server host 'kube-starrocks-fe-service' is unrecognized (-2)
[Wed Jun 26 18:52:42 CST 2024] Encountered error 1, will pause and then retry...
[Wed Jun 26 18:52:44 CST 2024] Attempting to remove myself from the Frontend (FE) service as kube-starrocks-cn-0.kube-starrocks-cn-search.starrocks.svc.cluster.local...
ERROR 2005 (HY000): The MySQL server host 'kube-starrocks-fe-service' is unrecognized (-2)
[Wed Jun 26 18:52:44 CST 2024] Encountered error 1, will pause and then retry...

However, the helm uninstall command was not blocked as a result. Even when I executed helm uninstall --cascade foreground --debug starrocks. My current Helm version is:

helm version
version.BuildInfo{Version:"v3.12.1", GitCommit:"f32a527a060157990e2aa86bf45010dfb3cc8b8d", GitTreeState:"clean", GoVersion:"go1.20.5"}
yandongxiao commented 4 months ago

Will try to fix it.