FudanSELab / train-ticket

Train Ticket - A Benchmark Microservice System
http://139.196.152.44:32677
Apache License 2.0
712 stars 244 forks source link

deploy pod "nacos-0" Init:CrashLoopBackOff #252

Closed lingdie closed 3 months ago

lingdie commented 1 year ago

Summary

deploy pod "nacos-0" Init:CrashLoopBackOff

Expected behaviour

pod nacos-0 should run

Current behaviour

deploy pod "nacos-0" Init:CrashLoopBackOff

Steps to reproduce

make depoly

[root@ip-172-31-27-85 train-ticket]# make deploy
args num: 2
Parse DeployArgs
Start deployment Step <1/3>------------------------------------
Start to deploy mysql cluster for nacos.
NAME: nacosdb
LAST DEPLOYED: Sat Jan 28 06:21:26 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The cluster is comprised of 3 pods: 1 leader and 2 followers. Each instance is accessible within the cluster through:

    <pod-name>.nacosdb-mysql

To connect to your database:

1. Get mysql user `nacos`'s password:

    kubectl get secret -n default nacosdb-mysql -o jsonpath="{.data.mysql-password}" | base64 --decode; echo

2. Run an Ubuntu pod that you can use as a client:

    kubectl run ubuntu -n default --image=ubuntu:focal -it --rm --restart='Never' -- bash -il

3. Install the mysql client:

    apt-get update && apt-get install mysql-client -y

4. To connect to leader service in the Ubuntu pod:

    mysql -h nacosdb-mysql-leader -u nacos -p

5. To connect to follower service (read-only) in the Ubuntu pod:

    mysql -h nacosdb-mysql-follower -u nacos -p
Waiting for mysql cluster of nacos to be ready ......
Waiting for 3 pods to be ready...
Waiting for 2 pods to be ready...
Waiting for 1 pods to be ready...
partitioned roll out complete: 3 new pods have been updated...
Start to deploy nacos.
NAME: nacos
LAST DEPLOYED: Sat Jan 28 06:30:13 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Waiting for nacos to be ready ......
Waiting for 3 pods to be ready...

Your environment

OS(e.g: cat /etc/os-release): ubuntu 22.04

Kubernetes version(use `kubectl version`): ```text [root@ip-172-31-27-85 ~]# kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-08T10:15:02Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-08T10:08:09Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"} ``` ### Additional context
[root@ip-172-31-27-85 ~]# kubectl get po -o wide
NAME              READY   STATUS                  RESTARTS      AGE     IP               NODE                                           NOMINATED NODE   READINESS GATES
nacos-0           0/1     Init:CrashLoopBackOff   4 (60s ago)   2m53s   100.85.89.134    ip-172-31-21-164.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-0   3/3     Running                 0             11m     100.86.8.69      ip-172-31-26-198.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-1   3/3     Running                 0             8m49s   100.85.89.133    ip-172-31-21-164.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-2   3/3     Running                 0             5m42s   100.89.137.138   ip-172-31-18-50.cn-north-1.compute.internal    <none>           <none>
[root@ip-172-31-27-85 ~]# kubectl describe po nacos-0
Name:             nacos-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-172-31-21-164.cn-north-1.compute.internal/172.31.21.164
Start Time:       Sat, 28 Jan 2023 06:30:13 +0000
Labels:           app=nacos
                  controller-revision-hash=nacos-95879c94d
                  statefulset.kubernetes.io/pod-name=nacos-0
Annotations:      cni.projectcalico.org/containerID: 244a57d5d7b7eaa12bb99dc0845034b5d67fb2f88960f6f3e6e13a23f8c546e7
                  cni.projectcalico.org/podIP: 100.85.89.134/32
                  cni.projectcalico.org/podIPs: 100.85.89.134/32
Status:           Pending
IP:               100.85.89.134
IPs:
  IP:           100.85.89.134
Controlled By:  StatefulSet/nacos
Init Containers:
  initmysql:
    Container ID:   containerd://313c55fb9d23014b15e403b181f5ca352ba28901843002a6b415f797c85b89b5
    Image:          codewisdom/mysqlclient:0.1
    Image ID:       docker.io/codewisdom/mysqlclient@sha256:9201e8dfe5eb4e845259730a6046c7b905566119d760ed7d5aef535ace972216
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 28 Jan 2023 06:32:06 +0000
      Finished:     Sat, 28 Jan 2023 06:32:06 +0000
    Ready:          False
    Restart Count:  4
    Environment Variables from:
      nacos-mysql  Secret  Optional: false
    Environment:   <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6xxs (ro)
Containers:
  k8snacos:
    Container ID:   
    Image:          nacos/nacos-server:2.0.1
    Image ID:       
    Ports:          8848/TCP, 7848/TCP, 9848/TCP, 9849/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment Variables from:
      nacos-mysql  Secret  Optional: false
    Environment:
      NACOS_REPLICAS:          3
      NACOS_SERVER_PORT:       8848
      NACOS_APPLICATION_PORT:  8848
      PREFER_HOST_MODE:        hostname
      MODE:                    cluster
      NACOS_SERVERS:           nacos-0.nacos-headless.default.svc.cluster.local:8848 nacos-1.nacos-headless.default.svc.cluster.local:8848 nacos-2.nacos-headless.default.svc.cluster.local:8848
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6xxs (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-c6xxs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  3m2s                 default-scheduler  Successfully assigned default/nacos-0 to ip-172-31-21-164.cn-north-1.compute.internal
  Normal   Pulling    3m1s                 kubelet            Pulling image "codewisdom/mysqlclient:0.1"
  Normal   Pulled     2m48s                kubelet            Successfully pulled image "codewisdom/mysqlclient:0.1" in 13.206324736s
  Normal   Created    69s (x5 over 2m48s)  kubelet            Created container initmysql
  Normal   Started    69s (x5 over 2m48s)  kubelet            Started container initmysql
  Normal   Pulled     69s (x4 over 2m47s)  kubelet            Container image "codewisdom/mysqlclient:0.1" already present on machine
  Warning  BackOff    69s (x9 over 2m46s)  kubelet            Back-off restarting failed container
[root@ip-172-31-27-85 ~]# kubectl logs nacos-0
Defaulted container "k8snacos" out of: k8snacos, initmysql (init)
Error from server (BadRequest): container "k8snacos" in pod "nacos-0" is waiting to start: PodInitializing
[root@ip-172-31-27-85 ~]# kubectl get po -o wide
NAME              READY   STATUS                  RESTARTS        AGE     IP               NODE                                           NOMINATED NODE   READINESS GATES
nacos-0           0/1     Init:CrashLoopBackOff   5 (2m24s ago)   5m39s   100.85.89.134    ip-172-31-21-164.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-0   3/3     Running                 0               14m     100.86.8.69      ip-172-31-26-198.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-1   3/3     Running                 0               11m     100.85.89.133    ip-172-31-21-164.cn-north-1.compute.internal   <none>           <none>
nacosdb-mysql-2   3/3     Running                 0               8m28s   100.89.137.138   ip-172-31-18-50.cn-north-1.compute.internal    <none>           <none>
[root@ip-172-31-27-85 ~]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-08T10:15:02Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-08T10:08:09Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Deep-Yellow commented 1 year ago

I guess there is a problem with the persistence configuration. There is a deployment document, hope this will help you. https://ttdoc.oss-cn-hongkong.aliyuncs.com/Steps.pdf

zyllee commented 1 year ago

I guess there is a problem with the persistence configuration. There is a deployment document, hope this will help you. https://ttdoc.oss-cn-hongkong.aliyuncs.com/Steps.pdf

I had the same problem and followed the instructions and it didn't work.

Deep-Yellow commented 1 year ago

I guess there is a problem with the persistence configuration. There is a deployment document, hope this will help you. https://ttdoc.oss-cn-hongkong.aliyuncs.com/Steps.pdf

I had the same problem and followed the instructions and it didn't work.

Use kubectl describe pod nacos-0 to view further details, and check whether the pv is correctly allocated. If not, there is something wrong with the configuration of openebs.

zyllee commented 1 year ago

Hi,@Deep-Yellow. It seems like this problem same as #246 and #263. When I run kubectl describe pod nacos-0 and get the result:

....
....
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m46s                  default-scheduler  Successfully assigned default/nacos-0 to master102
  Normal   Pulled     5m9s (x5 over 6m46s)   kubelet            Container image "codewisdom/mysqlclient:0.1" already present on machine
  Normal   Created    5m9s (x5 over 6m46s)   kubelet            Created container initmysql
  Normal   Started    5m9s (x5 over 6m46s)   kubelet            Started container initmysql
  Warning  BackOff    100s (x24 over 6m42s)  kubelet            Back-off restarting failed container initmysql in pod nacos-0_default(e77616fc-7854-4c8d-bbc4-35852061e6c5)

and My openEBS status is as follows(run kubectl get pods -n openebs):

# kubectl get pods -n openebs
NAME                                           READY   STATUS    RESTARTS   AGE
openebs-localpv-provisioner-697c988cc5-6t5vp   1/1     Running   0          3h32m
openebs-ndm-cluster-exporter-87f764699-zl2gg   1/1     Running   0          3h32m
openebs-ndm-kkj6z                              1/1     Running   0          3h32m
openebs-ndm-node-exporter-n5rd8                1/1     Running   0          3h32m
openebs-ndm-operator-5b984f4966-xhx4m          1/1     Running   0          3h32m

I try to check the log for more information.Here are the result:

# kubectl logs nacos-0
Defaulted container "k8snacos" out of: k8snacos, initmysql (init)
Error from server (BadRequest): container "k8snacos" in pod "nacos-0" is waiting to start: PodInitializing

# kubectl logs nacos-0 -c initmysql
ERROR 2002 (HY000): Can't connect to MySQL server on 'nacosdb-mysql-leader' (115)

I guess this problem is related with pvc or pv,so I check it.

# kubectl get pvc
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
data-nacosdb-mysql-0   Bound    pvc-e37d3b1c-2f5b-4acd-b62c-131be591c0e9   1Gi        RWO            openebs-hostpath   82m
data-nacosdb-mysql-1   Bound    pvc-a3c64b87-b307-4e31-9a4c-7496146e06d4   1Gi        RWO            openebs-hostpath   81m
data-nacosdb-mysql-2   Bound    pvc-5b4948cd-96a0-4d5b-b7e9-0a76e5493c82   1Gi        RWO            openebs-hostpath   80m

# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS       REASON   AGE
pvc-5b4948cd-96a0-4d5b-b7e9-0a76e5493c82   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-2   openebs-hostpath            80m
pvc-a3c64b87-b307-4e31-9a4c-7496146e06d4   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-1   openebs-hostpath            81m
pvc-e37d3b1c-2f5b-4acd-b62c-131be591c0e9   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-0   openebs-hostpath            82m

It seems work fine.


My env: centos7.9 k8s:1.26.2 RuntimeName: containerd RuntimeVersion: v1.6.14 only one node.

lingdie commented 1 year ago

nacos have a init container, which maybe is a mysql client trying to connect mysql cluster that created before.

  initmysql:
    Container ID:   containerd://313c55fb9d23014b15e403b181f5ca352ba28901843002a6b415f797c85b89b5
    Image:          codewisdom/mysqlclient:0.1
    Image ID:       docker.io/codewisdom/mysqlclient@sha256:9201e8dfe5eb4e845259730a6046c7b905566119d760ed7d5aef535ace972216
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 28 Jan 2023 06:32:06 +0000
      Finished:     Sat, 28 Jan 2023 06:32:06 +0000
    Ready:          False
    Restart Count:  4
    Environment Variables from:
      nacos-mysql  Secret  Optional: false
    Environment:   <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6xxs (ro)

May be the codewisdom/mysqlclient:0.1 image have been changed or configed wrongly?

Deep-Yellow commented 1 year ago

Hi,@Deep-Yellow. It seems like this problem same as #246 and #263. When I run kubectl describe pod nacos-0 and get the result:

....
....
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m46s                  default-scheduler  Successfully assigned default/nacos-0 to master102
  Normal   Pulled     5m9s (x5 over 6m46s)   kubelet            Container image "codewisdom/mysqlclient:0.1" already present on machine
  Normal   Created    5m9s (x5 over 6m46s)   kubelet            Created container initmysql
  Normal   Started    5m9s (x5 over 6m46s)   kubelet            Started container initmysql
  Warning  BackOff    100s (x24 over 6m42s)  kubelet            Back-off restarting failed container initmysql in pod nacos-0_default(e77616fc-7854-4c8d-bbc4-35852061e6c5)

and My openEBS status is as follows(run kubectl get pods -n openebs):

# kubectl get pods -n openebs
NAME                                           READY   STATUS    RESTARTS   AGE
openebs-localpv-provisioner-697c988cc5-6t5vp   1/1     Running   0          3h32m
openebs-ndm-cluster-exporter-87f764699-zl2gg   1/1     Running   0          3h32m
openebs-ndm-kkj6z                              1/1     Running   0          3h32m
openebs-ndm-node-exporter-n5rd8                1/1     Running   0          3h32m
openebs-ndm-operator-5b984f4966-xhx4m          1/1     Running   0          3h32m

I try to check the log for more information.Here are the result:

# kubectl logs nacos-0
Defaulted container "k8snacos" out of: k8snacos, initmysql (init)
Error from server (BadRequest): container "k8snacos" in pod "nacos-0" is waiting to start: PodInitializing

# kubectl logs nacos-0 -c initmysql
ERROR 2002 (HY000): Can't connect to MySQL server on 'nacosdb-mysql-leader' (115)

I guess this problem is related with pvc or pv,so I check it.

# kubectl get pvc
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
data-nacosdb-mysql-0   Bound    pvc-e37d3b1c-2f5b-4acd-b62c-131be591c0e9   1Gi        RWO            openebs-hostpath   82m
data-nacosdb-mysql-1   Bound    pvc-a3c64b87-b307-4e31-9a4c-7496146e06d4   1Gi        RWO            openebs-hostpath   81m
data-nacosdb-mysql-2   Bound    pvc-5b4948cd-96a0-4d5b-b7e9-0a76e5493c82   1Gi        RWO            openebs-hostpath   80m

# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS       REASON   AGE
pvc-5b4948cd-96a0-4d5b-b7e9-0a76e5493c82   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-2   openebs-hostpath            80m
pvc-a3c64b87-b307-4e31-9a4c-7496146e06d4   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-1   openebs-hostpath            81m
pvc-e37d3b1c-2f5b-4acd-b62c-131be591c0e9   1Gi        RWO            Delete           Bound    default/data-nacosdb-mysql-0   openebs-hostpath            82m

It seems work fine.

My env: centos7.9 k8s:1.26.2 RuntimeName: containerd RuntimeVersion: v1.6.14 only one node.

Can't connect to MySQL server on 'nacosdb-mysql-leader' Looks like Nacos couldn't connect to the database service, are you deploying in the default namespace? You need to check if the corresponding k8s Service is working properly. Provide a testing idea: create a new pod and try database connection in its container.

lingdie commented 1 year ago

It seems that the nacosdb-mysql-leader service has no endpoints (the service has a label selector role=leader), and all nacosdb-mysql- pods have the role=follower label, causing the initialization to fail.

This is what I see:

# kubectl describe svc -n train-ticket nacosdb-mysql-leader                                             [23:00:55]
Name:              nacosdb-mysql-leader
Namespace:         train-ticket
Labels:            app=nacosdb-mysql
                   app.kubernetes.io/managed-by=Helm
                   chart=mysql-1.0.0
                   heritage=Helm
                   release=nacosdb
Annotations:       meta.helm.sh/release-name: nacosdb
                   meta.helm.sh/release-namespace: train-ticket
Selector:          app=nacosdb-mysql,release=nacosdb,role=leader
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.3.12
IPs:               10.96.3.12
Port:              mysql  3306/TCP
TargetPort:        mysql/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

Pods info:

# kubectl describe po nacosdb-mysql-0 nacosdb-mysql-1 nacosdb-mysql-2 | grep role=follower --a 5 --before-context=5                       [23:10:29]
Node:             yy-worker/172.30.16.2
Start Time:       Mon, 17 Apr 2023 22:47:34 +0800
Labels:           app=nacosdb-mysql
                  controller-revision-hash=nacosdb-mysql-8655cb47b9
                  release=nacosdb
                  role=follower
                  statefulset.kubernetes.io/pod-name=nacosdb-mysql-0
Annotations:      checksum/config: b90c58b7645b67552901e093109469a55f3033e6fb12ecaf096606a107f0be2f
                  cni.projectcalico.org/containerID: f5d57b7cf3cc89044a12393575b46c9d32eb6fcd0246a1be8cef13ec5e628320
                  cni.projectcalico.org/podIP: 100.70.117.139/32
                  cni.projectcalico.org/podIPs: 100.70.117.139/32
--
Node:             yy-worker/172.30.16.2
Start Time:       Mon, 17 Apr 2023 22:49:19 +0800
Labels:           app=nacosdb-mysql
                  controller-revision-hash=nacosdb-mysql-8655cb47b9
                  release=nacosdb
                  role=follower
                  statefulset.kubernetes.io/pod-name=nacosdb-mysql-1
Annotations:      checksum/config: b90c58b7645b67552901e093109469a55f3033e6fb12ecaf096606a107f0be2f
                  cni.projectcalico.org/containerID: 6584baec2760a57033dda68bfc29acb4a7c239639d969fe63923a406ae797df0
                  cni.projectcalico.org/podIP: 100.70.117.141/32
                  cni.projectcalico.org/podIPs: 100.70.117.141/32
--
Node:             yy-worker/172.30.16.2
Start Time:       Mon, 17 Apr 2023 22:50:59 +0800
Labels:           app=nacosdb-mysql
                  controller-revision-hash=nacosdb-mysql-8655cb47b9
                  release=nacosdb
                  role=follower
                  statefulset.kubernetes.io/pod-name=nacosdb-mysql-2
Annotations:      checksum/config: b90c58b7645b67552901e093109469a55f3033e6fb12ecaf096606a107f0be2f
                  cni.projectcalico.org/containerID: c390d650622a1ce6b24c11df15e44863643df66fdc2fc2f3f139e8304ed8e73d
                  cni.projectcalico.org/podIP: 100.70.117.143/32
                  cni.projectcalico.org/podIPs: 100.70.117.143/32

I manually edit pod nacosdb-mysql-0 label role=follower to role=leader and it works.

here is the patch cmd:

kubectl patch pod -n train-ticket nacosdb-mysql-0 --patch "{\"metadata\":{\"labels\":{\"role\":\"leader\"}}}"

and after running it, nacos-0 init success. (pending is because of the resource limited.)

root@yy-master: ~/src/train-ticket master!
# kubectl get po -n train-ticket                                                                                                                          [23:28:41]
NAME              READY   STATUS    RESTARTS   AGE
nacos-0           1/1     Running   0          11m
nacos-1           1/1     Running   0          10m
nacos-2           0/1     Pending   0          10m
nacosdb-mysql-0   3/3     Running   0          41m
nacosdb-mysql-1   3/3     Running   0          39m
nacosdb-mysql-2   3/3     Running   0          37m

I'm not sure about that this patch cmd works in all situtaion.


Update:

I find that pod nacosdb-mysql- 's label role=leader is patched in scripts leader-start.sh which is defined in configmap deployment/kubernetes-manifests/quickstart-k8s/charts/mysql/templates/configmap.yaml and is mounted in deployment/kubernetes-manifests/quickstart-k8s/charts/mysql/templates/statefulset.yaml.

script content:

#!/usr/bin/env bash
curl -X PATCH -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" -H "Content-Type: application/json-patch+json" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/{{ .Release.Namespace }}/pods/$HOSTNAME \
    -d '[{"op": "replace", "path": "/metadata/labels/role", "value": "leader"}]'

It seems that only this script only changes the label role's value to leader.

@Deep-Yellow need your help to check and fix this🤪.

luvkushp commented 1 year ago

Is there any new update in the opened case?

emilykmarx commented 1 year ago

I ran into this problem on one machine but not another. Patching the label and setting read-only was not enough for me -- several of the pods still failed to start with com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure. However, I found a different (hacky) fix that allows all pods to start.

On the failing machine, it looks like xenon fails to elect a leader in the mysql cluster, due to [ERROR] mysql[localhost:3306].ping.error[Error 1045: Access denied for user 'root'@'::1' (using password: NO)].downs:11,downslimits:3 I believe the root cause is the mysql host (localhost) resolves to ::1 on the failing machine but 127.0.0.1 on the succeeding machine, and the root user is created on 127.0.0.1 but not ::1. The failing machine logs [ERROR] mysql[localhost:3306].ping.error[dial tcp [::1]:3306: connect: connection refused].downs:0,downslimits:3 when it is trying to start up, whereas the succeeding machine has [ERROR] mysql[localhost:3306].ping.error[dial tcp 127.0.0.1:3306: connect: connection refused].downs:0,downslimits:3.

This seems to be a known issue in xenon: https://github.com/radondb/radondb-mysql-kubernetes/issues/289, but the fix there of setting mysqlOpts.rootHost doesn't work since train-ticket uses the helm version of xenon, not the operator. Instead, after starting the nacosdb-mysql cluster I manually added the root@::1 user to mysql and restarted xenon. That is: 

for pod in $(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep nacosdb-mysql); do 
  kubectl exec $pod -- mysql -uroot -e "CREATE USER 'root'@'::1' IDENTIFIED WITH mysql_native_password BY '' ; GRANT ALL ON *.* TO 'root'@'::1' WITH GRANT OPTION ;"
  kubectl exec $pod -c xenon -- /sbin/reboot
done

I do similarly for the tsdb-mysql pods before starting the services. After this, xenon elects a leader and all pods start successfully.

vincent-haoy commented 1 year ago

This is because the openness wasn't appropriately installed using Helm. Don't use Kubectl apply -f openebs.yaml.

emilykmarx commented 1 year ago

This is because the openness wasn't appropriately installed using Helm. Don't use Kubectl apply -f openebs.yaml.

That wasn't the issue for me -- I installed openebs with helm and still had this problem.

yinfangchen commented 1 year ago

I ran into this problem on one machine but not another. Patching the label and setting read-only was not enough for me -- several of the pods still failed to start with com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure. However, I found a different (hacky) fix that allows all pods to start.

On the failing machine, it looks like xenon fails to elect a leader in the mysql cluster, due to [ERROR] mysql[localhost:3306].ping.error[Error 1045: Access denied for user 'root'@'::1' (using password: NO)].downs:11,downslimits:3 I believe the root cause is the mysql host (localhost) resolves to ::1 on the failing machine but 127.0.0.1 on the succeeding machine, and the root user is created on 127.0.0.1 but not ::1. The failing machine logs [ERROR] mysql[localhost:3306].ping.error[dial tcp [::1]:3306: connect: connection refused].downs:0,downslimits:3 when it is trying to start up, whereas the succeeding machine has [ERROR] mysql[localhost:3306].ping.error[dial tcp 127.0.0.1:3306: connect: connection refused].downs:0,downslimits:3.

This seems to be a known issue in xenon: radondb/radondb-mysql-kubernetes#289, but the fix there of setting mysqlOpts.rootHost doesn't work since train-ticket uses the helm version of xenon, not the operator. Instead, after starting the nacosdb-mysql cluster I manually added the root@::1 user to mysql and restarted xenon. That is:

for pod in $(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep nacosdb-mysql); do 
  kubectl exec $pod -- mysql -uroot -e "CREATE USER 'root'@'::1' IDENTIFIED WITH mysql_native_password BY '' ; GRANT ALL ON *.* TO 'root'@'::1' WITH GRANT OPTION ;"
  kubectl exec $pod -c xenon -- /sbin/reboot
done

I do similarly for the tsdb-mysql pods before starting the services. After this, xenon elects a leader and all pods start successfully.

I tried this. Sometimes it can work but sometimes does not. When should I run this shell script? @emilykmarx What does "before starting the services" mean? Should I execute the script when all three pods are running? Thanks a lot:)

emilykmarx commented 1 year ago

@yinfangchen When using the "all_in_one" option for mysql, the shell script needs to be run twice: Once after the nacosdb-mysql cluster starts (i.e. after kubectl rollout status statefulset/$nacosDBRelease-mysql -n $namespace succeeds), and once after the tsdb-mysql cluster starts (i.e. after gen_secret_for_services $tsUser $tsPassword $tsDB "${tsMysqlName}-mysql-leader succeeds)

yinfangchen commented 1 year ago

Thanks for your reply! @emilykmarx Yes, I run the make deploy directly which should be the "all-in-one" for mysql cluster.

I run the script after all three nacosdb-mysql-X are ready (but not the nacos-X): NAMESPACE NAME READY STATUS
default nacos-0 0/1 Init:Error
default nacosdb-mysql-0 3/3 Running
default nacosdb-mysql-1 3/3 Running
default nacosdb-mysql-2 3/3 Running

However, after applying the script, the slaves of nacosdb-mysql generate these logs as shown below:

│ mysql 2023-10-18T05:27:34.017896+08:00 510 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysql-bin.000002' at position 154, relay log '/var/lib/mysql/mysql-r │
│ mysql 2023-10-18T05:27:34.018153+08:00 510 [ERROR] Slave SQL for channel '': Error 'Operation CREATE USER failed for 'root'@'::1'' on query. Default database: ''. Query: 'CREATE USER 'root'@': │
│ mysql 2023-10-18T05:27:34.018168+08:00 510 [Warning] Slave: Operation CREATE USER failed for 'root'@'::1' Error_code: 1396                                                                    │
│ mysql 2023-10-18T05:27:34.018171+08:00 510 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql │
│ mysql 2023-10-18T05:27:40.094289+08:00 511 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysql-bin.000002' at position 154, relay log '/var/lib/mysql/mysql-r │
│ mysql 2023-10-18T05:27:40.094933+08:00 511 [ERROR] Slave SQL for channel '': Error 'Operation CREATE USER failed for 'root'@'::1'' on query. Default database: ''. Query: 'CREATE USER 'root'@': │
│ mysql 2023-10-18T05:27:40.094952+08:00 511 [Warning] Slave: Operation CREATE USER failed for 'root'@'::1' Error_code: 1396
...

The master looks right, but its logs show the slaves are running incorrectly:

mysql.master.status:&{mysql-bin.000002 11310 true true 704ba7ff-e808-449a-a470-ccf44941a661:1-16 0}
mysal.slave.status:&{ 0 false false }
mysql.master.status:&{mysql-bin.000002 11310 true true 704ba7ff-e808-449a-a470-ccf44941a661:1-16 0}
mysal.slave.status:&{ 0 false false }
...

Update: I solved this by creating "breakpoints" after the deployment of nacosdb-mysql and tsdb-mysql. Instead of using make deploy, I do the make step by step, and stop after deploying nacosdb-mysql, and wait them to be ready. Similarly, I continue making, and stop after deploying tsdb-mysql and wait them to be ready, before doing the rest of the deployment.

emilykmarx commented 1 year ago

@yinfangchen

creating "breakpoints"

Yes, this is what I do as well. I don't really need mysql replication for my purposes so I didn't look carefully at the slave statuses, but I am seeing the same logs as you mention even when creating the "breakpoints", so I'm not sure that truly solves it. Maybe there is a race so it sometimes appears solved?

Anyway I'm not planning on investigating further since I don't need replication, but if you'd like to -- one thought is to look more closely at how xenon sets up the 'root'@'127.0.0.1' user, and make sure everything is set up the same way for the 'root'@'::1' user. I think this happens here (although it may be different in the version of xenon used in train-ticket), which is where I got the sql statements in the bash script (out of curiosity I did try adding FLUSH PRIVILEGES; and RESET SLAVE ALL; to the end of the script, but it didn't help). Let me know if you find a solution - I'm curious :)

AlessandroR1273 commented 5 months ago

Sorry, I'm having some problems similar to yours deploying train-ticket on minikube using the make deploy. After I run the patch

kubectl patch pod -n train-ticket nacosdb-mysql-0 --patch "{\"metadata\":{\"labels\":{\"role\":\"leader\"}}}"

the nacos-X starts, but every other pods (except few) are in "CrashBackOffLoop" status. I've tried the fix suggested by @emilykmarx and @yinfangchen, but nothing worked. Probabily, since I'm new to Kubernetes and train-ticket, I fail some steps, but I'm unable to figure out which one.

kubectl get pods -n train-ticket
NAME                                           READY   STATUS             RESTARTS         AGE
nacos-0                                        1/1     Running            0                9h
nacos-1                                        1/1     Running            0                9h
nacos-2                                        1/1     Running            0                9h
nacosdb-mysql-0                                2/3     Running            0                9h
nacosdb-mysql-1                                2/3     Running            0                9h
nacosdb-mysql-2                                2/3     Running            0                9h
rabbitmq-6b45f6b576-qjfdk                      1/1     Running            0                9h
ts-admin-basic-info-service-7c5fcd444d-jrk7r   0/1     CrashLoopBackOff   67 (4m59s ago)   9h
ts-admin-order-service-7bd9dcd58-fvgg8         0/1     Running            70 (5m36s ago)   9h
ts-admin-route-service-745bc4b8c4-pcst4        0/1     Running            68 (5m45s ago)   9h
ts-admin-travel-service-7959c77bc-v94cb        0/1     CrashLoopBackOff   67 (5m16s ago)   9h
ts-admin-user-service-7456587c5-dr2gj          0/1     CrashLoopBackOff   69 (4m29s ago)   9h
ts-assurance-service-8cb959cd4-n25dt           0/1     CrashLoopBackOff   69 (113s ago)    9h
ts-auth-service-8578d6fc48-ljrx8               0/1     CrashLoopBackOff   68 (3m51s ago)   9h
ts-avatar-service-77dbf84c6f-gjqxp             1/1     Running            0                9h
ts-basic-service-57bfc8c599-88kqf              0/1     CrashLoopBackOff   69 (5m7s ago)    9h
ts-cancel-service-6ccf6d9f47-nt282             0/1     Running            67 (5m19s ago)   9h
ts-config-service-64df858d44-8j7dd             0/1     Error              67 (7m34s ago)   9h
ts-consign-price-service-754f69d7f6-gkwkh      0/1     CrashLoopBackOff   69 (2m19s ago)   9h
ts-consign-service-664b76b857-g9rj6            0/1     CrashLoopBackOff   68 (2m18s ago)   9h
ts-contacts-service-5cbb865568-9vw9v           0/1     CrashLoopBackOff   70 (2m23s ago)   9h
ts-delivery-service-777dc5f7db-sx46z           0/1     Running            72 (6m53s ago)   9h
ts-execute-service-65ff45c5d6-gbjxf            0/1     CrashLoopBackOff   67 (95s ago)     9h
ts-food-delivery-service-98b4bfc6-ctlnr        0/1     CrashLoopBackOff   65 (2m6s ago)    9h
ts-food-service-8fd65b7db-c5dn5                0/1     CrashLoopBackOff   70 (2m26s ago)   9h
ts-gateway-service-9b7c5cfd7-9ncl4             0/1     Running            67 (6m10s ago)   9h
ts-inside-payment-service-55f4b7d9b4-vj4b4     0/1     Error              66 (7m53s ago)   9h
ts-news-service-7db9d5bc96-49s6q               1/1     Running            0                9h
ts-notification-service-75f9c77c8-9fffj        0/1     CrashLoopBackOff   66 (5m2s ago)    9h
ts-order-other-service-7bb8c4597-nstdz         0/1     Running            68 (5m36s ago)   9h
ts-order-service-6cdfdf7df9-mb7hc              0/1     Running            70 (7m9s ago)    9h
ts-payment-service-784f5d456b-dcfzt            0/1     CrashLoopBackOff   60 (117s ago)    9h
ts-preserve-other-service-655db949b4-jwlkl     0/1     CrashLoopBackOff   64 (3m13s ago)   9h
ts-preserve-service-5cc4689-6bf4c              0/1     CrashLoopBackOff   65 (4m3s ago)    9h
ts-price-service-74fc78d84d-q6jw7              0/1     CrashLoopBackOff   63 (113s ago)    9h
ts-rebook-service-5ff9548cf6-z5x47             0/1     CrashLoopBackOff   65 (3m37s ago)   9h
ts-route-plan-service-7d46d9887b-pjn2v         0/1     CrashLoopBackOff   65 (5m14s ago)   9h
ts-route-service-66fcd66cb8-676q2              0/1     Running            61 (7m29s ago)   9h
ts-seat-service-77db454874-6pvt2               0/1     CrashLoopBackOff   62 (3m11s ago)   9h
ts-security-service-65fc6f87fd-x7fxm           0/1     CrashLoopBackOff   65 (20s ago)     9h
ts-station-food-service-b655bfb56-d97d9        0/1     CrashLoopBackOff   64 (114s ago)    9h
ts-station-service-8557c6c9b7-547b4            0/1     CrashLoopBackOff   65 (118s ago)    9h
ts-ticket-office-service-748997dd88-q9snt      1/1     Running            21 (7h2m ago)    9h
ts-train-food-service-7d98bc469b-kp9fk         0/1     Running            65 (7m5s ago)    9h
ts-train-service-6478f56b9d-2tktw              0/1     CrashLoopBackOff   62 (2m9s ago)    9h
ts-travel-plan-service-86ccb87d5c-q9zn7        0/1     Running            62 (7m18s ago)   9h
ts-travel-service-7878bcb89b-bv2b6             0/1     CrashLoopBackOff   63 (4m58s ago)   9h
ts-travel2-service-58c65f66b6-vbldj            0/1     CrashLoopBackOff   66 (55s ago)     9h
ts-ui-dashboard-6fb976bc5c-x6h2f               1/1     Running            0                9h
ts-user-service-6d878cf5fc-tdkxx               0/1     CrashLoopBackOff   63 (109s ago)    9h
ts-verification-code-service-94b945f6b-bqkzt   0/1     Running            67 (5m23s ago)   9h
ts-voucher-service-86b967d98-j5jgh             1/1     Running            21 (7h2m ago)    9h
ts-wait-order-service-59895d98cf-9g6dz         0/1     CrashLoopBackOff   67 (4m53s ago)   9h
tsdb-mysql-0                                   2/3     Running            0                9h
tsdb-mysql-1                                   3/3     Running            0                9h
tsdb-mysql-2                                   3/3     Running            0                9h

Can anyone help me with this? Thanks a lot <3

datawine commented 5 months ago

I investigated a bit and modify train-ticket/deployment/kubernetes-manifests/quickstart-k8s/charts/mysql/values.yaml to add bind-address from absent to 0.0.0.0:

  configFiles:
    node.cnf: |
      [mysqld]
      bind-address = 0.0.0.0
      default_storage_engine=InnoDB
      max_connections=65535

This works for me :)

Also note that if the log outputs [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts., you might want to set ulimit and aio-max-nr to bigger numbers.

@AlessandroR1273

micturkey commented 4 months ago

I investigated a bit and modify train-ticket/deployment/kubernetes-manifests/quickstart-k8s/charts/mysql/values.yaml to add bind-address from absent to 0.0.0.0:

  configFiles:
    node.cnf: |
      [mysqld]
      bind-address = 0.0.0.0
      default_storage_engine=InnoDB
      max_connections=65535

This works for me :)

Also note that if the log outputs [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts., you might want to set ulimit and aio-max-nr to bigger numbers.

@AlessandroR1273

Thanks! This fix solves my issue.