edgexr / edge-cloud-platform

Apache License 2.0
1 stars 0 forks source link

Delete Azure-based AppInst pod state check fails #393

Open gainsley opened 15 hours ago

gainsley commented 15 hours ago

Deleting an AppInst on an Azure-based cloudlet is reporting a failure, even though the AppInst is actually removed correctly from the cluster. There is some issue with the pod state check when we are waiting for the pod to be removed:

The workaround is to delete the AppInst again, which will properly detect the pods are not present, and succeed.

Here are some of the error messages, and logs:

2024-11-26T06:36:50.322Z        INFO    4318ce34abc0fba2        k8smgmt/appinst.go:101  pod is running  {"podName": "us-jon-test-k8s10jondevorg10-deployment-5fdbd5b967-vfm9b"}
2024-11-26T06:36:51.322Z        INFO    4318ce34abc0fba2        k8smgmt/appinst.go:64   check pods status       {"namespace": "default", "selector": "mex-app=us-jon-test-k8s10jondevorg10-deployment"}
2024-11-26T06:36:52.128Z        INFO    4318ce34abc0fba2        crmutil/controller-data.go:577  can't delete app inst   {"error": "Delete App Inst failed: Run container failed, pod state: Failed - Name:
 rpc error: code = Unknown desc = Delete App Inst failed: DELETE
 https://console.cloud.edgexr.org/operatorplatform/federation/v1/63719bab-ce31-4e22-90d6-7d10bac92352/application/lcm/app/us-jon-test-k8s10jondevorg/instance/fedtest/zone/us-azure-westus
 failed: Delete App Inst failed: Run container failed, pod state: Failed - No
 resources found in default namespace.
 rpc error: code = Unknown desc = Delete App Inst failed: DELETE
 https://console.cloud.edgexr.org/operatorplatform/federation/v1/269df0a6-6287-4bbc-8d74-26cae09ea268/application/lcm/app/us-jon-test-k8s10jondevorg/instance/fedtestinst/zone/us-azure-westus
 failed: Delete App Inst failed: Run container failed, pod state: Failed -
 Name:                      us-jon-test-k8s10jondevorg10-deployment-5fdbd5b967-vfm9b

 Namespace:                 default

 Priority:                  0

 Service Account:           default

 Node:                      aks-agentpool-64407650-vmss000000/10.224.0.4

 Start Time:                Tue, 26 Nov 2024 06:28:23 +0000

 Labels:                    mex-app=us-jon-test-k8s10jondevorg10-deployment
                            mexAppInstName=fedtestinst
                            mexAppInstOrg=hostfed
                            mexDeployGen=kubernetes-basic
                            pod-template-hash=5fdbd5b967
                            run=us-jon-test-k8s10jondevorg1.0
 Annotations:               <none>

 Status:                    Terminating (lasts 1s)

 Termination Grace Period:  30s

 IP:                        10.244.0.11

 IPs:
   IP:           10.244.0.11
 Controlled By:  ReplicaSet/us-jon-test-k8s10jondevorg10-deployment-5fdbd5b967

 Containers:
   us-jon-test-k8s10jondevorg10:
     Container ID:  containerd://d9d09fd3fa61ee179496679a522b9f8df6322c3d694cb37fe9e1e0c7924d0f0f
     Image:         docker.io/hashicorp/http-echo:0.2.3
     Image ID:      docker.io/hashicorp/http-echo@sha256:ba27d460cd1f22a1a4331bdf74f4fccbc025552357e8a3249c40ae216275de96
     Port:          5678/TCP
     Host Port:     0/TCP
     Args:
       -text="hello to edgexr"
     State:          Terminated
       Reason:       Error
       Exit Code:    137
       Started:      Tue, 26 Nov 2024 06:28:25 +0000
       Finished:     Tue, 26 Nov 2024 06:36:51 +0000
     Ready:          False
     Restart Count:  0
     Environment:    <none>
     Mounts:
       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nrlgr (ro)
 Conditions:
   Type                        Status
   PodReadyToStartContainers   False 
   Initialized                 True 
   Ready                       False 
   ContainersReady             False 
   PodScheduled                True 
 Volumes:
   kube-api-access-nrlgr:
     Type:                    Projected (a volume that contains injected data from multiple sources)
     TokenExpirationSeconds:  3607
     ConfigMapName:           kube-root-ca.crt
     ConfigMapOptional:       <nil>
     DownwardAPI:             true
 QoS Class:                   BestEffort

 Node-Selectors:              <none>

 Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists
 for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
 Events:
   Type     Reason                           Age                  From               Message
   ----     ------                           ----                 ----               -------
   Normal   Scheduled                        8m30s                default-scheduler  Successfully assigned default/us-jon-test-k8s10jondevorg10-deployment-5fdbd5b967-vfm9b to aks-agentpool-64407650-vmss000000
   Normal   Pulling                          8m29s                kubelet            Pulling image "docker.io/hashicorp/http-echo:0.2.3"
   Normal   Pulled                           8m27s                kubelet            Successfully pulled image "docker.io/hashicorp/http-echo:0.2.3" in 1.813s (1.813s including waiting)
   Normal   Created                          8m27s                kubelet            Created container us-jon-test-k8s10jondevorg10
   Normal   Started                          8m27s                kubelet            Started container us-jon-test-k8s10jondevorg10
   Warning  FailedToRetrieveImagePullSecret  38s (x9 over 8m29s)  kubelet            Unable to retrieve some image pull secrets (docker.io); attempting to pull the image may not succeed.
   Normal   Killing                          31s                  kubelet            Stopping container us-jon-test-k8s10jondevorg10