Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

[BUG] Virtual nodes: CrashLoopBackOff for aci-connector-linux pod #3344

Open joelnotified opened 1 year ago

joelnotified commented 1 year ago

Describe the bug We have enabled the virtual node plugin to run some Kubernetes jobs in ACI. The jobs are triggered and run just fine (from what I can tell). But the aci-connector-linux pod keeps restarting over and over. This causes succeeded jobs to never be cleaned up from ACI either.

It seems to be exiting with error code 2. See below for logs

To Reproduce Not sure. We've installed the plugin via the azure cli.

Expected behavior That aci-connector-linux doesn't crash/restart and that it will clean up succeeded jobs.

Screenshots image image

Environment (please complete the following information):

Additional context Logs from aci-connector-linux

WARNING: Package "github.com/golang/protobuf/protoc-gen-go/generator" is deprecated.
    A future release of golang/protobuf will delete this package,
    which has long been excluded from the compatibility promise.

time="2022-11-17T09:14:53Z" level=info msg="Using user identity for Authentication"
time="2022-11-17T09:15:10Z" level=info msg=Initialized node=virtual-node-aci-linux operatingSystem=Linux provider=azure watchedNamespace=
time="2022-11-17T09:15:10Z" level=info msg="Pod cache in-sync" node=virtual-node-aci-linux operatingSystem=Linux provider=azure watchedNamespace=
time="2022-11-17T09:15:13Z" level=info msg="starting workers" node=virtual-node-aci-linux operatingSystem=Linux provider=azure watchedNamespace=
time="2022-11-17T09:15:13Z" level=info msg="started workers" node=virtual-node-aci-linux operatingSystem=Linux provider=azure watchedNamespace=
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12e7971]

goroutine 241 [running]:
github.com/virtual-kubelet/azure-aci/pkg/provider.podStatusFromContainerGroup(0xc0011b8c30)
    /workspace/pkg/provider/aci.go:1810 +0xb11
github.com/virtual-kubelet/azure-aci/pkg/provider.containerGroupToPod(0xc0011b8c30)
    /workspace/pkg/provider/aci.go:1763 +0xbde
github.com/virtual-kubelet/azure-aci/pkg/provider.(*ACIProvider).GetPod(0xc0003a8600, {0x1ad62e8?, 0xc00039db30?}, {0xc0006d1700, 0xf}, {0xc0006d16c0, 0xe})
    /workspace/pkg/provider/aci.go:757 +0x2bc
github.com/virtual-kubelet/virtual-kubelet/node.(*PodController).createOrUpdatePod(0xc000a17ad0, {0x1ad62e8?, 0xc00039d320?}, 0xc000467400)
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/node/pod.go:86 +0x30f
github.com/virtual-kubelet/virtual-kubelet/node.(*PodController).syncPodInProvider(0xc000a17ad0, {0x1ad62e8?, 0xc00039ccc0?}, 0xc000467400, {0xc00062a7a0, 0x1e})
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/node/podcontroller.go:568 +0x4e5
github.com/virtual-kubelet/virtual-kubelet/node.(*PodController).syncPodFromKubernetesHandler(0xc000a17ad0, {0x1ad62e8?, 0xc00039c840?}, {0xc00062a7a0, 0x1e})
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/node/podcontroller.go:504 +0x8b2
github.com/virtual-kubelet/virtual-kubelet/internal/queue.(*Queue).handleQueueItemObject(0xc000824180, {0x1ad62e8?, 0xc00039c1b0?}, 0xc000a39100)
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/internal/queue/queue.go:436 +0x3eb
github.com/virtual-kubelet/virtual-kubelet/internal/queue.(*Queue).handleQueueItem(0x1ad6240?, {0x1ad62e8?, 0xc000817c20?})
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/internal/queue/queue.go:404 +0x1b7
github.com/virtual-kubelet/virtual-kubelet/internal/queue.(*Queue).worker(0xc000824180, {0x1ad6240, 0xc0007f6440}, 0x0?)
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/internal/queue/queue.go:332 +0x17d
github.com/virtual-kubelet/virtual-kubelet/internal/queue.(*Queue).Run.func2({0x1ad6240?, 0xc0007f6440?})
    /go/pkg/mod/github.com/virtual-kubelet/virtual-kubelet@v1.6.0/internal/queue/queue.go:320 +0x34
k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithContext.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.19.10/pkg/util/wait/wait.go:64 +0x25
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()

Result from kubectl describe pod -n kube-system aci-connector-linux:

Name:                 aci-connector-linux-fcd878b5d-qqxnx
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      aci-connector-linux
Node:                 aks-system-37915096-vmss000002/10.20.3.194
Start Time:           Tue, 08 Nov 2022 11:15:46 +0100
Labels:               app=aci-connector-linux
                      kubernetes.azure.com/managedby=aks
                      pod-template-hash=fcd878b5d
Annotations:          checksum/cloud-provider-config: dfec74ae556b1607c2be7e6eb6d4a7163e0f354d193a5a02955302de65569380
                      cluster-autoscaler.kubernetes.io/safe-to-evict: true
Status:               Running
IP:                   10.20.3.213
IPs:
  IP:           10.20.3.213
Controlled By:  ReplicaSet/aci-connector-linux-fcd878b5d
Containers:
  aci-connector-linux:
    Container ID:  containerd://048c3ef8b8dd08ebc78003ad798ec51cb4ed78d9b44984d2bd176df81c51398d
    Image:         mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.6
    Image ID:      mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet@sha256:9a8b1b6addf18d54942bd8898eff463d9f992a4784e4e7ca9e45b9ea68da3d40
    Port:          <none>
    Host Port:     <none>
    Command:
      virtual-kubelet
    Args:
      --provider
      azure
      --nodename
      virtual-node-aci-linux
      --os
      Linux
      --authentication-token-webhook=true
      --no-verify-clients=false
      --client-verify-ca=/etc/kubernetes/certs/ca.crt
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 17 Nov 2022 09:59:35 +0100
      Finished:     Thu, 17 Nov 2022 09:59:35 +0100
    Ready:          False
    Restart Count:  65
    Environment:
      ACI_EXTRA_USER_AGENT:                add-on/aks
      VKUBELET_POD_IP:                      (v1:status.podIP)
      AZURE_CLIENT_SECRET:                 <set to the key 'clientSecret' in secret 'aci-connector-linux'>  Optional: false
      CLUSTER_CIDR:                        10.20.0.0/20
      USE_VK_VERSION_2:                    true
      AKS_CREDENTIAL_LOCATION:             /etc/acs/azure.json
      MASTER_URI:                          <redacted>
      KUBE_DNS_IP:                         10.0.0.10
      LOG_ANALYTICS_ID:                    <set to the key 'WSID' in secret 'ama-logs-secret'>  Optional: false
      CLUSTER_RESOURCE_ID:                 <redacted>
      ENABLE_REAL_TIME_METRICS:            true
      KUBERNETES_PORT_443_TCP_ADDR:        <redacted>
      KUBERNETES_PORT_443_TCP:             <redacted>
      APISERVER_CERT_LOCATION:             /etc/virtual-kubelet/cert.pem
      KUBERNETES_SERVICE_HOST:             <redacted>
      APISERVER_KEY_LOCATION:              /etc/virtual-kubelet/key.pem
      ACI_SUBNET_NAME:                     snet-<redacted>-virtual-nodes-staging
      VIRTUALNODE_USER_IDENTITY_CLIENTID:  782ea0ca-930d-44bf-9be4-338980bb377d
      LOG_ANALYTICS_KEY:                   <set to the key 'KEY' in secret 'ama-logs-secret'>  Optional: false
      KUBERNETES_PORT:                     <redacted>
      KUBELET_PORT:                        10250
    Mounts:
      /etc/acs/azure.json from aks-credential (rw)
      /etc/kubernetes/certs from certificates (ro)
      /etc/virtual-kubelet from credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4vkfr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  certificates:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/certs
    HostPathType:
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aci-connector-linux
    Optional:    false
  aks-credential:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  kube-api-access-4vkfr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Normal   Created  15m (x62 over 8d)    kubelet  Created container aci-connector-linux
  Normal   Started  15m (x62 over 8d)    kubelet  Started container aci-connector-linux
  Normal   Pulled   15m (x61 over 8d)    kubelet  Container image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.6" already present on machine
  Warning  BackOff  86s (x1292 over 8d)  kubelet  Back-off restarting failed container
ayelencasamassa commented 1 year ago

Same issue here. Can't rollback to 1.4.5 and 1.4.7 is not working neither.

joelnotified commented 1 year ago

Should I file this in https://github.com/virtual-kubelet/virtual-kubelet instead, or is this AKS related?

ayelencasamassa commented 1 year ago

I have a ticket here and on MS itself and nobody is giving me a solution. They only want me to downgrade the severity case and insist on using "container apps" (a solution that has no more than a 6 month on market).

joelnotified commented 1 year ago

Yeah, I have a support ticket open with Azure as well. So far I haven't received any sensible response to that one.

I wouldn't mind using Azure Container Apps if they supported the concept of Jobs, which they don't yet: https://github.com/microsoft/azure-container-apps/issues/24

ayelencasamassa commented 1 year ago

I won't use a technology that has been on preview until October. Virtual nodes is more "mature" an look how it goes :/

Szbuli commented 1 year ago

Same issue for us

ghost commented 1 year ago

Action required from @Azure/aks-pm

joelnotified commented 1 year ago

I was in contact with Azure support and they confirmed it as a bug in Virtual Kubelet 1.4.6. They (Azure) decided to rollback to last known working version (1.4.5) and it seems to be working fine for us now.