Apisix-Etcd-0 CrashLoopBackOff

shayktrust commented 4 months ago

Name and Version

targetRevision: v2.7.0

EKS Version

1.29

What architecture are you using?

amd64

What steps will reproduce the bug?

Deploy the chart

Are you using any custom parameters or values?

Yes

What is the expected behavior?

apisix-etcd-0 apisix-etcd-1 and apisix-etcd-2 in a Running mode

What do you see instead?

CrashLoopBackOff

{"level":"warn","ts":"2024-07-08T10:17:39.664Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2024-07-08T10:17:39.664Z","caller":"etcdserver/server.go:1128","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","request-path":"/0/members/dca459b91c9da974/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-07-08T10:17:39.665Z","caller":"etcdserver/server.go:2073","msg":"stopped publish because server is stopped","local-member-id":"dca459b91c9da974","local-member-attributes":"{Name:apisix-etcd-0 ClientURLs:[http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2379 http://apisix-etcd.api-gw.svc.cluster.local:2379]}","publish-timeout":"7s","error":"etcdserver: server stopped"}

kubectl get all -n api-gw 
NAME                                             READY   STATUS             RESTARTS         AGE
pod/apisix-6fdf6b9c66-64z2f                      1/1     Running            0                17h
pod/apisix-6fdf6b9c66-brsjk                      1/1     Running            0                17h
pod/apisix-6fdf6b9c66-jz2rk                      1/1     Running            0                17h
pod/apisix-etcd-0                                0/1     CrashLoopBackOff   203 (38s ago)    17h
pod/apisix-etcd-1                                0/1     CrashLoopBackOff   202 (5m2s ago)   17h
pod/apisix-etcd-2                                1/1     Running            0                17h
pod/apisix-ingress-controller-844c65bfdf-5v799   1/1     Running            0                17h
pod/apisix-ingress-controller-844c65bfdf-fzhvg   1/1     Running            0                22h

NAME                                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/apisix-admin                               ClusterIP   172.20.187.18    <none>        9180/TCP                     22h
service/apisix-etcd                                ClusterIP   172.20.177.186   <none>        2379/TCP,2380/TCP            22h
service/apisix-etcd-headless                       ClusterIP   None             <none>        2379/TCP,2380/TCP            22h
service/apisix-gateway                             NodePort    172.20.111.3     <none>        80:31196/TCP,443:31359/TCP   22h
service/apisix-ingress-controller                  ClusterIP   172.20.246.190   <none>        80/TCP                       22h
service/apisix-ingress-controller-apisix-gateway   NodePort    172.20.219.206   <none>        80:30461/TCP,443:31371/TCP   22h

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/apisix                      3/3     3            3           22h
deployment.apps/apisix-ingress-controller   2/2     2            2           22h

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/apisix-6fdf6b9c66                      3         3         3       22h
replicaset.apps/apisix-ingress-controller-844c65bfdf   2         2         2       22h

NAME                           READY   AGE
statefulset.apps/apisix-etcd   1/3     22h

NAME                                         REFERENCE           TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/apisix   Deployment/apisix   7%/80%, 61%/80%   3         6         3          22h

kubectl describe pods -n api-gw apisix-etcd-0
Name:             apisix-etcd-0
Namespace:        api-gw
Priority:         0
Service Account:  default
Node:             ip-10-0-18-102.eu-north-1.compute.internal/10.0.18.102
Start Time:       Sun, 07 Jul 2024 20:15:28 +0300
Labels:           app.kubernetes.io/instance=apisix
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=etcd
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=apisix-etcd-5d9864fd68
                  helm.sh/chart=etcd-8.7.7
                  statefulset.kubernetes.io/pod-name=apisix-etcd-0
Annotations:      checksum/token-secret: 622d20823882c1300c1be66970c8a4304a57e6d674f4c7da8a29e8e8062bb7c1
Status:           Running
IP:               10.0.18.31
IPs:
  IP:           10.0.18.31
Controlled By:  StatefulSet/apisix-etcd
Containers:
  etcd:
    Container ID:   containerd://3e7d388fe249ab387b0f2af890addeffc3fe592b8ee8f4e47362d2e6dd33f13a
    Image:          docker.io/bitnami/etcd:3.5.7-debian-11-r14
    Image ID:       docker.io/bitnami/etcd@sha256:0825cafa1c5f0c97d86009f3af8c0f5a9d4279fcfdeb0a2a09b84a1eb7893a13
    Ports:          2379/TCP, 2380/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 08 Jul 2024 13:27:59 +0300
      Finished:     Mon, 08 Jul 2024 13:28:04 +0300
    Ready:          False
    Restart Count:  203
    Liveness:       exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
    Readiness:      exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:                                       false
      MY_POD_IP:                                            (v1:status.podIP)
      MY_POD_NAME:                                         apisix-etcd-0 (v1:metadata.name)
      MY_STS_NAME:                                         apisix-etcd
      ETCDCTL_API:                                         3
      ETCD_ON_K8S:                                         yes
      ETCD_START_FROM_SNAPSHOT:                            no
      ETCD_DISASTER_RECOVERY:                              no
      ETCD_NAME:                                           $(MY_POD_NAME)
      ETCD_DATA_DIR:                                       /bitnami/etcd/data
      ETCD_LOG_LEVEL:                                      info
      ALLOW_NONE_AUTHENTICATION:                           yes
      ETCD_AUTH_TOKEN:                                     jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m
      ETCD_ADVERTISE_CLIENT_URLS:                          http://$(MY_POD_NAME).apisix-etcd-headless.api-gw.svc.cluster.local:2379,http://apisix-etcd.api-gw.svc.cluster.local:2379
      ETCD_LISTEN_CLIENT_URLS:                             http://0.0.0.0:2379
      ETCD_INITIAL_ADVERTISE_PEER_URLS:                    http://$(MY_POD_NAME).apisix-etcd-headless.api-gw.svc.cluster.local:2380
      ETCD_LISTEN_PEER_URLS:                               http://0.0.0.0:2380
      ETCD_INITIAL_CLUSTER_TOKEN:                          etcd-cluster-k8s
      ETCD_INITIAL_CLUSTER_STATE:                          new
      ETCD_INITIAL_CLUSTER:                                apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.api-gw.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.api-gw.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.api-gw.svc.cluster.local:2380
      ETCD_CLUSTER_DOMAIN:                                 apisix-etcd-headless.api-gw.svc.cluster.local
      NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME:          dev
      NEW_RELIC_METADATA_KUBERNETES_NODE_NAME:              (v1:spec.nodeName)
      NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME:        api-gw (v1:metadata.namespace)
      NEW_RELIC_METADATA_KUBERNETES_POD_NAME:              apisix-etcd-0 (v1:metadata.name)
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME:        etcd
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME:  docker.io/bitnami/etcd:3.5.7-debian-11-r14
    Mounts:
      /bitnami/etcd from data (rw)
      /opt/bitnami/etcd/certs/token/ from etcd-jwt-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h8wk9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-apisix-etcd-0
    ReadOnly:   false
  etcd-jwt-token:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apisix-etcd-jwt-token
    Optional:    false
  kube-api-access-h8wk9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Warning  BackOff  2s (x4811 over 17h)  kubelet  Back-off restarting failed container etcd in pod apisix-etcd-0_api-gw(607771a9-2674-40b5-a6d5-4be0971f0706)

kworkbee commented 4 months ago

I also distribute and operate APISIX to EKS, and I have experienced the same phenomenon for a long time. In my experience, I identified it as a problem caused by damage to Quorum in the process of node rearrangement of each pod of ETCD due to the influence of Karpenter operating in EKS. I don't know Karpenter in detail, so I can't give you any advice on this, but in order to reduce the possibility of Quorum damage due to node rearrangement, I have changed to increase the replicaCount of etcd and have stabilized recently to resolve the related phenomenon.

shayktrust commented 4 months ago

Hi @kworkbee ,

I am using Cluster Autoscaler, not Karpenter. Additionally, since I am implementing high availability, I have configured etcd to mount to EFS using a StorageClass. This setup ensures compatibility across all my availability zones.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-apisix
provisioner: efs.csi.aws.com
allowVolumeExpansion: true
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-xxxxxxxx
  directoryPerms: "777"
  gidRangeStart: "1000"
  gidRangeEnd: "2000"
reclaimPolicy: Retain
mountOptions:
  - tls
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-apisix
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-apisix
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-xxxxxxxx
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: eks/nodeGroupSize
          operator: In
          values:
          - BIG
        - key: eks/efs
          operator: In
          values:
          - indeed
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-apisix-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-apisix
  resources:
    requests:
      storage: 10Gi

using the following command temporarily resolved my issue:

kubectl delete pvc -l app.kubernetes.io/name=etcd -n <namespace>
kubectl delete statefulset apisix-etcd -n <namespace>

apache / apisix-helm-chart