Open shayktrust opened 4 months ago
I also distribute and operate APISIX to EKS, and I have experienced the same phenomenon for a long time.
In my experience, I identified it as a problem caused by damage to Quorum in the process of node rearrangement of each pod of ETCD due to the influence of Karpenter operating in EKS.
I don't know Karpenter in detail, so I can't give you any advice on this, but in order to reduce the possibility of Quorum damage due to node rearrangement, I have changed to increase the replicaCount
of etcd and have stabilized recently to resolve the related phenomenon.
Hi @kworkbee ,
I am using Cluster Autoscaler, not Karpenter. Additionally, since I am implementing high availability, I have configured etcd to mount to EFS using a StorageClass. This setup ensures compatibility across all my availability zones.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-apisix
provisioner: efs.csi.aws.com
allowVolumeExpansion: true
parameters:
provisioningMode: efs-ap
fileSystemId: fs-xxxxxxxx
directoryPerms: "777"
gidRangeStart: "1000"
gidRangeEnd: "2000"
reclaimPolicy: Retain
mountOptions:
- tls
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-apisix
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-apisix
csi:
driver: efs.csi.aws.com
volumeHandle: fs-xxxxxxxx
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: eks/nodeGroupSize
operator: In
values:
- BIG
- key: eks/efs
operator: In
values:
- indeed
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-apisix-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-apisix
resources:
requests:
storage: 10Gi
using the following command temporarily resolved my issue:
kubectl delete pvc -l app.kubernetes.io/name=etcd -n <namespace>
kubectl delete statefulset apisix-etcd -n <namespace>
Name and Version
targetRevision: v2.7.0
EKS Version
1.29
What architecture are you using?
amd64
What steps will reproduce the bug?
Deploy the chart
Are you using any custom parameters or values?
Yes
What is the expected behavior?
apisix-etcd-0 apisix-etcd-1 and apisix-etcd-2 in a Running mode
What do you see instead?
CrashLoopBackOff