berops / claudie

Cloud-agnostic managed Kubernetes
https://docs.claudie.io/
Apache License 2.0
634 stars 40 forks source link

Bug: Kuber fails to delete node. #1570

Open Despire opened 1 week ago

Despire commented 1 week ago

Kuber fails to delete node, unable to ensure replication of pvc.

2024-11-11T10:09:40Z INF Deleting nodes - control nodes [0], compute nodes[1] cluster=e2e-684iw35 module=kuber
e2e-684iw35     node/azr-auto-cmpt-aln243g-02 already cordoned
e2e-684iw35     volume.longhorn.io/pvc-07b9c964-95c6-469e-9569-b97aefa6176f patched
2024-11-11T10:09:41Z INF Waiting 10 seconds for new replicas to be scheduled if possible for node azr-auto-cmpt-aln243g-02 of cluster cluster=e2e-684iw35 module=kuber
2024-11-11T10:14:51Z ERR Error while deleting nodes error="error while making sure storage is replicated before deletion on cluster e2e-684iw35 : error while checking if all longhorn replicas for volume pvc-07b9c964-95c6-469e-9569-b97aefa6176f are running : error while checking the status of volume pvc-07b9c964-95c6-469e-9569-b97aefa6176f replication : context deadline exceeded" cluster=e2e-684iw35 module=kuber
kubectl get pvc --kubeconfig ./e2e -n claudie-6da472a-3109
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
data-minio-0   Bound    pvc-6e037dfe-bbe1-48f9-8459-4240c03e1fe8   1Gi        RWO            longhorn       <unset>                 110m
data-minio-1   Bound    pvc-07b9c964-95c6-469e-9569-b97aefa6176f   1Gi        RWO            longhorn       <unset>                 110m
data-minio-2   Bound    pvc-df215457-c4d6-41eb-819f-45a1bc3dcff1   1Gi        RWO            longhorn       <unset>                 110m
data-minio-3   Bound    pvc-46bd9d2b-8ac5-4174-9c9f-3e5c9896d440   1Gi        RWO            longhorn       <unset>                 110m
dynamo-pvc     Bound    pvc-e9306039-5da5-49bc-bb1e-0070edc85a52   1Gi        RWO            longhorn       <unset>                 110m
mongo-pvc      Bound    pvc-b61b07fa-3141-4aec-b7e5-9d5de3253c00   10Gi       RWO            longhorn       <unset>                 110m

Steps To Reproduce

  1. Create a Cluster
    
    nodePools:
    dynamic:
    - name: gcp-ctrl-nodes
      providerSpec:
        name: gcp-1
        region: europe-west1
        zone: europe-west1-c
      count: 1
      serverType: e2-medium
      image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
      storageDiskSize: 50
    - name: gcp-cmpt-nodes
      providerSpec:
        name: gcp-2
        region: europe-west2
        zone: europe-west2-a
      count: 3
      serverType: e2-small
      image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
      storageDiskSize: 50

kubernetes: clusters:

  1. Deploy claudie on the cluster.
    
    kubectl get pods -A --kubeconfig ./test
    NAMESPACE         NAME                                                READY   STATUS      RESTARTS        AGE
    cert-manager      cert-manager-5bd57786d4-hpnqj                       1/1     Running     0               29m
    cert-manager      cert-manager-cainjector-57657d5754-8q6kc            1/1     Running     0               29m
    cert-manager      cert-manager-webhook-7d9f8748d4-nm89r               1/1     Running     0               29m
    claudie           ansibler-547d5d4477-g4mds                           1/1     Running     0               4m54s
    claudie           builder-74c6c4bc6d-grvdn                            1/1     Running     0               4m54s
    claudie           claudie-operator-779c58f857-6wsz6                   1/1     Running     0               4m53s
    claudie           create-table-job-k5m9m                              0/1     Completed   1               4m52s
    claudie           dynamodb-d764d9d9d-zmfnt                            1/1     Running     0               4m53s
    claudie           kube-eleven-6c449847c-h694k                         1/1     Running     1 (2m41s ago)   4m53s
    claudie           kuber-757f496d76-wjjcm                              1/1     Running     0               4m53s
    claudie           make-bucket-job-vdmwj                               0/1     Completed   0               4m52s
    claudie           manager-7c86c5dff6-wrwsv                            1/1     Running     2 (3m39s ago)   4m53s
    claudie           minio-0                                             1/1     Running     0               4m52s
    claudie           minio-1                                             1/1     Running     0               4m52s
    claudie           minio-2                                             1/1     Running     0               4m52s
    claudie           minio-3                                             1/1     Running     0               4m51s
    claudie           mongodb-b79df96d5-d4wjd                             1/1     Running     0               4m53s
    claudie           terraformer-5767b65455-nhchr                        1/1     Running     0               4m52s
    kube-system       cilium-bf2gk                                        1/1     Running     0               64m
    kube-system       cilium-hdvq9                                        1/1     Running     0               64m
    kube-system       cilium-operator-555d4c4d76-thnpv                    1/1     Running     0               65m
    kube-system       cilium-sfzsv                                        1/1     Running     0               63m
    kube-system       cilium-xsp6c                                        1/1     Running     0               63m
    kube-system       coredns-646d7c4457-mfw4s                            1/1     Running     0               65m
    kube-system       coredns-646d7c4457-nj7vb                            1/1     Running     0               65m
    kube-system       etcd-gcp-ctrl-nodes-m6kpu0e-01                      1/1     Running     0               67m
    kube-system       hubble-relay-d65ffb68f-qtrzs                        1/1     Running     0               65m
    kube-system       hubble-ui-86f6cd444-5w55q                           2/2     Running     0               65m
    kube-system       kube-apiserver-gcp-ctrl-nodes-m6kpu0e-01            1/1     Running     0               67m
    kube-system       kube-controller-manager-gcp-ctrl-nodes-m6kpu0e-01   1/1     Running     0               66m
    kube-system       kube-proxy-5vqg9                                    1/1     Running     0               65m
    kube-system       kube-proxy-cxkxj                                    1/1     Running     0               66m
    kube-system       kube-proxy-gqktw                                    1/1     Running     0               65m
    kube-system       kube-proxy-zc2mf                                    1/1     Running     0               65m
    kube-system       kube-scheduler-gcp-ctrl-nodes-m6kpu0e-01            1/1     Running     0               67m
    kube-system       metrics-server-b65cdc569-dt5mr                      1/1     Running     0               66m
    longhorn-system   csi-attacher-6c4495498-mtk7j                        1/1     Running     0               62m
    longhorn-system   csi-attacher-6c4495498-svb2k                        1/1     Running     0               62m
    longhorn-system   csi-attacher-6c4495498-xpwr9                        1/1     Running     0               62m
    longhorn-system   csi-provisioner-7d8cf4f58f-4rjxk                    1/1     Running     0               62m
    longhorn-system   csi-provisioner-7d8cf4f58f-fv5qn                    1/1     Running     0               62m
    longhorn-system   csi-provisioner-7d8cf4f58f-wvv74                    1/1     Running     0               62m
    longhorn-system   csi-resizer-77b968dfcd-d75cd                        1/1     Running     0               62m
    longhorn-system   csi-resizer-77b968dfcd-hxvxm                        1/1     Running     0               62m
    longhorn-system   csi-resizer-77b968dfcd-wmp4q                        1/1     Running     0               62m
    longhorn-system   csi-snapshotter-77699d78fb-7bp22                    1/1     Running     0               62m
    longhorn-system   csi-snapshotter-77699d78fb-jzrmm                    1/1     Running     0               62m
    longhorn-system   csi-snapshotter-77699d78fb-xszk8                    1/1     Running     0               62m
    longhorn-system   engine-image-ei-04c05bf8-4zvc2                      1/1     Running     0               63m
    longhorn-system   engine-image-ei-04c05bf8-hprbx                      1/1     Running     0               63m
    longhorn-system   engine-image-ei-04c05bf8-qgq6x                      1/1     Running     0               63m
    longhorn-system   instance-manager-68e54022cb63a38fa27117f456d8ec6b   1/1     Running     0               62m
    longhorn-system   instance-manager-be12e3a749d94492623a35549fdfbf8b   1/1     Running     0               62m
    longhorn-system   instance-manager-c602a5b2595317e1ceb34084760c4288   1/1     Running     0               63m
    longhorn-system   longhorn-csi-plugin-57d6n                           3/3     Running     0               62m
    longhorn-system   longhorn-csi-plugin-blxc5                           3/3     Running     0               62m
    longhorn-system   longhorn-csi-plugin-kw5gj                           3/3     Running     0               62m
    longhorn-system   longhorn-driver-deployer-55b7b5c7b4-wckhj           1/1     Running     2 (63m ago)     64m
    longhorn-system   longhorn-manager-6dwjw                              2/2     Running     0               64m
    longhorn-system   longhorn-manager-jx4q7                              2/2     Running     1 (63m ago)     64m
    longhorn-system   longhorn-manager-r6lbd                              2/2     Running     1 (63m ago)     63m
    longhorn-system   longhorn-ui-786c6ff-tcql6                           1/1     Running     0               64m
    longhorn-system   longhorn-ui-786c6ff-xg4b5                           1/1     Running     0               64m
![1](https://github.com/user-attachments/assets/6bc66e74-e7bb-4fee-8571-cf34940b2abf)

3. Decrease the count in the nodepool by 1.

kubectl get pods -A --kubeconfig ./test NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager cert-manager-5bd57786d4-hpnqj 1/1 Running 0 41m cert-manager cert-manager-cainjector-57657d5754-8q6kc 1/1 Running 0 41m cert-manager cert-manager-webhook-7d9f8748d4-nm89r 1/1 Running 1 (4m18s ago) 41m claudie ansibler-547d5d4477-dq7rt 1/1 Running 1 (3m37s ago) 7m48s claudie builder-74c6c4bc6d-4ps6r 1/1 Running 0 7m47s claudie claudie-operator-779c58f857-6wsz6 1/1 Running 0 17m claudie create-table-job-k5m9m 0/1 Completed 1 17m claudie dynamodb-d764d9d9d-zmfnt 1/1 Running 0 17m claudie kube-eleven-6c449847c-h694k 1/1 Running 2 (4m3s ago) 17m claudie kuber-757f496d76-2vggg 1/1 Running 1 (3m35s ago) 7m47s claudie make-bucket-job-vdmwj 0/1 Completed 0 17m claudie manager-7c86c5dff6-wrwsv 1/1 Running 2 (16m ago) 17m claudie minio-0 1/1 Running 0 7m43s claudie minio-1 1/1 Running 0 17m claudie minio-2 1/1 Running 0 17m claudie minio-3 1/1 Running 0 7m43s claudie mongodb-b79df96d5-d4wjd 1/1 Running 0 17m claudie terraformer-5767b65455-nhchr 1/1 Running 0 17m kube-system cilium-2b22k 1/1 Running 0 88s kube-system cilium-5zvjp 1/1 Running 0 109s kube-system cilium-ldbpq 1/1 Running 0 110s kube-system cilium-operator-555d4c4d76-thnpv 1/1 Running 0 78m kube-system coredns-778c49ccf5-62pfm 1/1 Running 0 2m28s kube-system coredns-778c49ccf5-8xkjn 1/1 Running 0 2m28s kube-system etcd-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 79m kube-system hubble-generate-certs-mqsmm 0/1 Completed 0 2m16s kube-system hubble-relay-d65ffb68f-5xwdr 1/1 Running 1 (4m12s ago) 7m47s kube-system hubble-ui-86f6cd444-t75hp 2/2 Running 0 7m47s kube-system kube-apiserver-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 79m kube-system kube-controller-manager-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 79m kube-system kube-proxy-5vqg9 1/1 Running 0 78m kube-system kube-proxy-cxkxj 1/1 Running 0 79m kube-system kube-proxy-zc2mf 1/1 Running 0 78m kube-system kube-scheduler-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 79m kube-system metrics-server-b65cdc569-dt5mr 1/1 Running 0 78m longhorn-system csi-attacher-6c4495498-svb2k 1/1 Running 1 (102s ago) 75m longhorn-system csi-attacher-6c4495498-tcg5t 1/1 Running 1 (7m11s ago) 7m47s longhorn-system csi-attacher-6c4495498-xpwr9 1/1 Running 0 75m longhorn-system csi-provisioner-7d8cf4f58f-4rjxk 1/1 Running 0 75m longhorn-system csi-provisioner-7d8cf4f58f-g7gk2 1/1 Running 0 7m47s longhorn-system csi-provisioner-7d8cf4f58f-wvv74 1/1 Running 0 75m longhorn-system csi-resizer-77b968dfcd-d75cd 1/1 Running 0 75m longhorn-system csi-resizer-77b968dfcd-wmp4q 1/1 Running 0 75m longhorn-system csi-resizer-77b968dfcd-z8pzf 1/1 Running 0 7m47s longhorn-system csi-snapshotter-77699d78fb-7bp22 1/1 Running 0 75m longhorn-system csi-snapshotter-77699d78fb-jzrmm 1/1 Running 0 75m longhorn-system csi-snapshotter-77699d78fb-z9z8f 1/1 Running 0 7m48s longhorn-system engine-image-ei-04c05bf8-4zvc2 1/1 Running 0 76m longhorn-system engine-image-ei-04c05bf8-qgq6x 1/1 Running 0 76m longhorn-system instance-manager-be12e3a749d94492623a35549fdfbf8b 1/1 Running 0 75m longhorn-system instance-manager-c602a5b2595317e1ceb34084760c4288 1/1 Running 0 75m longhorn-system longhorn-csi-plugin-57d6n 3/3 Running 0 75m longhorn-system longhorn-csi-plugin-kw5gj 3/3 Running 2 (3m49s ago) 75m longhorn-system longhorn-driver-deployer-55b7b5c7b4-wckhj 1/1 Running 2 (76m ago) 76m longhorn-system longhorn-manager-6dwjw 2/2 Running 0 76m longhorn-system longhorn-manager-jx4q7 2/2 Running 1 (76m ago) 76m longhorn-system longhorn-ui-786c6ff-tcql6 1/1 Running 0 76m longhorn-system longhorn-ui-786c6ff-xg4b5 1/1 Running 0 76m

If we look at the volumes they are now in degraded state due to a replice not being able to be scheduled

Scheduling Failure Replica Scheduling Failure Error Message: precheck new replica failed

![2](https://github.com/user-attachments/assets/eb5f37f9-1e5c-49cc-871b-6361c42df4f3)

4. Further Decrease node by 1.
Now some of the volumes are completely detached and some are degraded due to unable to schedule replicas

![3](https://github.com/user-attachments/assets/291c7597-6f64-451f-a76a-3e6acc629210)

kubectl get pods -A --kubeconfig ./test NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager cert-manager-5bd57786d4-hpnqj 1/1 Running 0 51m cert-manager cert-manager-cainjector-57657d5754-jqzxr 1/1 Running 0 5m29s cert-manager cert-manager-webhook-7d9f8748d4-x5xl2 1/1 Running 0 5m29s claudie ansibler-547d5d4477-t28xr 0/1 Pending 0 5m29s claudie builder-74c6c4bc6d-4rsjr 1/1 Running 0 5m29s claudie claudie-operator-779c58f857-9mmtz 0/1 Pending 0 5m29s claudie dynamodb-d764d9d9d-njgkv 0/1 Pending 0 5m28s claudie kube-eleven-6c449847c-7stz6 0/1 Pending 0 5m28s claudie kuber-757f496d76-rxbqv 0/1 Pending 0 5m27s claudie manager-7c86c5dff6-wrwsv 0/1 Running 2 (26m ago) 27m claudie minio-0 0/1 Pending 0 5m13s claudie minio-1 0/1 Pending 0 5m13s claudie minio-2 1/1 Running 0 27m claudie minio-3 1/1 Running 0 17m claudie mongodb-b79df96d5-p7nnb 0/1 Pending 0 5m30s claudie terraformer-5767b65455-nhchr 0/1 Running 0 27m kube-system cilium-jpskg 1/1 Running 0 58s kube-system cilium-kq9h4 1/1 Running 0 57s kube-system cilium-operator-555d4c4d76-thnpv 1/1 Running 0 88m kube-system coredns-76c4f7868f-knqtc 1/1 Running 0 92s kube-system coredns-76c4f7868f-w7fp9 1/1 Running 0 92s kube-system etcd-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 90m kube-system hubble-relay-d65ffb68f-xhpvp 1/1 Running 0 5m29s kube-system hubble-ui-86f6cd444-96pl7 2/2 Running 0 5m27s kube-system kube-apiserver-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 90m kube-system kube-controller-manager-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 89m kube-system kube-proxy-5vqg9 1/1 Running 0 88m kube-system kube-proxy-cxkxj 1/1 Running 0 89m kube-system kube-scheduler-gcp-ctrl-nodes-m6kpu0e-01 1/1 Running 0 90m kube-system metrics-server-b65cdc569-dt5mr 1/1 Running 0 88m longhorn-system csi-attacher-6c4495498-xcqdz 1/1 Running 0 5m27s longhorn-system csi-attacher-6c4495498-xpwr9 1/1 Running 0 85m longhorn-system csi-attacher-6c4495498-zpcrk 1/1 Running 0 5m26s longhorn-system csi-provisioner-7d8cf4f58f-d6l55 1/1 Running 0 5m27s longhorn-system csi-provisioner-7d8cf4f58f-kkz9v 1/1 Running 0 5m27s longhorn-system csi-provisioner-7d8cf4f58f-wvv74 1/1 Running 0 85m longhorn-system csi-resizer-77b968dfcd-4trb2 1/1 Running 0 5m30s longhorn-system csi-resizer-77b968dfcd-t6g2t 1/1 Running 0 5m30s longhorn-system csi-resizer-77b968dfcd-wmp4q 1/1 Running 0 85m longhorn-system csi-snapshotter-77699d78fb-4njkk 1/1 Running 0 5m30s longhorn-system csi-snapshotter-77699d78fb-4zzvm 1/1 Running 0 5m30s longhorn-system csi-snapshotter-77699d78fb-jzrmm 1/1 Running 0 85m longhorn-system engine-image-ei-04c05bf8-qgq6x 1/1 Running 0 86m longhorn-system instance-manager-be12e3a749d94492623a35549fdfbf8b 1/1 Running 0 85m longhorn-system longhorn-csi-plugin-57d6n 3/3 Running 0 85m longhorn-system longhorn-driver-deployer-55b7b5c7b4-wckhj 1/1 Running 2 (86m ago) 86m longhorn-system longhorn-manager-jx4q7 2/2 Running 1 (86m ago) 86m longhorn-system longhorn-ui-786c6ff-jqkqn 1/1 Running 0 5m30s longhorn-system longhorn-ui-786c6ff-tcql6 1/1 Running 0 86m


This steps simulate the scale down that has happened in the e2e cluster, if we were to repeat this many times, i.e. add nodes then delete nodes at some point we would hit the problem described above. I've managed to reproduce it only once.

The issue is that we deploy a storage class we numOfReplicas: 3

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: {{ .StorageClassName }} labels: claudie.io/storage-class: {{ .StorageClassName }} provisioner: driver.longhorn.io parameters: fromBackup: "" nodeSelector: {{ .ZoneName }} fsType: xfs numberOfReplicas: "3" staleReplicaTimeout: "28800" reclaimPolicy: Delete allowVolumeExpansion: true volumeBindingMode: Immediate



This can become an issue especially for autoscaled clusters that allow nodepools with less than 3 nodes
Despire commented 3 days ago

explore different volumeBindingMode: Immediate and have validation that prevents having less than 3 worker nodes