Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

AKS starting a stopped cluster fails #2171

Open FunkyFabe opened 3 years ago

FunkyFabe commented 3 years ago

What happened: Starting a stopped AKS cluster with az aks start -g test-bench-exam-scheduler -n aks-test-bench fails. (az aks stop -g test-bench-exam-scheduler -n aks-test-bench was used to stop the cluster before.)

What you expected to happen: The cluster should start properly.

How to reproduce it (as minimally and precisely as possible): Create new AKS cluster. Then stop it and start it again with the CLI commands.

Anything else we need to know?: Starting the cluster throws the following failure:

Deployment failed. Correlation ID: b0c3a88f-8afb-48f3-9fc5-2ee696afbd1f. Agents are unable to resolve Kubernetes API server name. It's likely custom DNS server is not correctly configured, please see https://aka.ms/aks/private-cluster#hub-and-spoke-with-custom-dns for more information. Details: VMSSAgentPoolReconciler retry failed: {
  "Category": "InternalError",
  "SubCode": "LookupNoSuchHost",
  "Dependency": "Microsoft.Compute/VirtualMachineScaleSet",
  "OriginError": {
   "code": "VMExtensionProvisioningError",
   "message": "VM has reported a failure when processing extension 'vmssCSE'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=52\n[stdout]\ntion-token-webhook=true--authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --dynamic-config-dir=/var/lib/kubelet --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available\u003c750Mi,nodefs.available\u003c10%!,(MISSING)nodefs.inodesFree\u003c5%!f(MISSING)eature-gates=RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false --kube-reserved=cpu=260m,memory=3645Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --network-plugin=kubenet --node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:1.3.1 --pod-manifest-path=/etc/kubernetes/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-connection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/kubeletserver.key\\n\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.465950 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.566313 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.588877 3714 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: failed to list *v1.Node: Get \\\"https://aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io:443/api/v1/nodes?fieldSelector=metadata.name%!D(MISSING)aks-agentpool-26511950-vmss000001\u0026limit=500\u0026resourceVersion=0\\\": dial tcp: lookup aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io: no such host\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.666514 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.766801 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.822991 3714 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.RuntimeClass: failed to list *v1beta1.RuntimeClass: Get \\\"https://aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io:443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500\u0026resourceVersion=0\\\": dial tcp: lookup aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io: no such host\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.867079 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:19 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:19.967392 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\\nMar 10 08:16:20 aks-agentpool-26511950-vmss000001 kubelet[3714]: I0310 08:16:20.054943 3714 csi_plugin.go:994] Failed to contact API server when waiting for CSINode publishing: Get \\\"https://aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io:443/apis/storage.k8s.io/v1/csinodes/aks-agentpool-26511950-vmss000001\\\": dial tcp: lookup aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io: no such host\\nMar 10 08:16:20 aks-agentpool-26511950-vmss000001 kubelet[3714]: E0310 08:16:20.067576 3714 kubelet.go:2186] node \\\"aks-agentpool-26511950-vmss000001\\\" not found\", \"Error\": \"\" }\n\n[stderr]\n\"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot ",
   "target": null,
   "details": null,
   "innererror": null,
   "additionalInfo": null
  }
 }

Environment: Resource JSON of cluster

{
  "id": "/subscriptions/xxx/resourcegroups/test-bench-exam-scheduler/providers/Microsoft.ContainerService/managedClusters/aks-test-bench",
  "location": "switzerlandnorth",
  "name": "aks-test-bench",
  "type": "Microsoft.ContainerService/ManagedClusters",
  "properties": {
    "provisioningState": "Failed",
    "powerState": {
      "code": "Running"
    },
    "kubernetesVersion": "1.19.7",
    "dnsPrefix": "aks-test-bench-dns",
    "fqdn": "aks-test-bench-dns-40b2217c.hcp.switzerlandnorth.azmk8s.io",
    "agentPoolProfiles": [
      {
        "name": "agentpool",
        "count": 1,
        "vmSize": "Standard_F16s_v2",
        "osDiskSizeGB": 128,
        "osDiskType": "Managed",
        "maxPods": 110,
        "type": "VirtualMachineScaleSets",
        "provisioningState": "Failed",
        "powerState": {
          "code": "Running"
        },
        "orchestratorVersion": "1.19.7",
        "nodeLabels": {},
        "mode": "System",
        "osType": "Linux",
        "nodeImageVersion": "AKSUbuntu-1804gen2containerd-2021.02.17"
      }
    ],
    "servicePrincipalProfile": {
      "clientId": "msi"
    },
    "addonProfiles": {
      "azurePolicy": {
        "enabled": false,
        "config": null
      },
      "httpApplicationRouting": {
        "enabled": false,
        "config": null
      },
      "omsAgent": {
        "enabled": true,
        "config": {
          "logAnalyticsWorkspaceResourceID": "/subscriptions/xxx/resourceGroups/DefaultResourceGroup-CHN/providers/Microsoft.OperationalInsights/workspaces/DefaultWorkspace-63941cfd-7685-4122-8403-0d86fb03eca7-CHN"
        },
        "identity": {
          "resourceId": "/subscriptions/xxx/resourcegroups/MC_test-bench-exam-scheduler_aks-test-bench_switzerlandnorth/providers/Microsoft.ManagedIdentity/userAssignedIdentities/omsagent-aks-test-bench",
          "clientId": "201bbd22-d172-4de1-bf89-1598d742039e",
          "objectId": "d524fd5c-45ef-4812-90a6-43858b166c70"
        }
      }
    },
    "nodeResourceGroup": "MC_test-bench-exam-scheduler_aks-test-bench_switzerlandnorth",
    "enableRBAC": true,
    "networkProfile": {
      "networkPlugin": "kubenet",
      "loadBalancerSku": "Standard",
      "loadBalancerProfile": {
        "managedOutboundIPs": {
          "count": 1
        },
        "effectiveOutboundIPs": [
          {
            "id": "/subscriptions/xxx/resourceGroups/MC_test-bench-exam-scheduler_aks-test-bench_switzerlandnorth/providers/Microsoft.Network/publicIPAddresses/c1561c8a-ebed-40c8-b258-0c7e065b9ce4"
          }
        ]
      },
      "podCidr": "10.244.0.0/16",
      "serviceCidr": "10.0.0.0/16",
      "dnsServiceIP": "10.0.0.10",
      "dockerBridgeCidr": "172.17.0.1/16",
      "outboundType": "loadBalancer"
    },
    "maxAgentPools": 10,
    "apiServerAccessProfile": {
      "enablePrivateCluster": false
    },
    "identityProfile": {
      "kubeletidentity": {
        "resourceId": "/subscriptions/xxx/resourcegroups/MC_test-bench-exam-scheduler_aks-test-bench_switzerlandnorth/providers/Microsoft.ManagedIdentity/userAssignedIdentities/aks-test-bench-agentpool",
        "clientId": "xxx",
        "objectId": "xxx"
      }
    }
  },
  "identity": {
    "type": "SystemAssigned",
    "principalId": "xxx",
    "tenantId": "xxx"
  },
  "sku": {
    "name": "Basic",
    "tier": "Free"
  }
}
ghost commented 3 years ago

Hi FunkyFabe, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

jbouzekri commented 3 years ago

I have the exact same issue. If stop/start the cluster without waiting too long (a few seconds) between the two actions, it starts again successfully. I assume that the stop action on the cluster level is really stopping everything (after a time) even the managed apiserver on Azure side. As a proof, this DNS issue on the managed apiserver. Then when we start the cluster again, the node are going up faster than the managed side on Azure so when they try to contact the apiserver, they cannot do so and the cluster enters failed state.

karyjac commented 3 years ago

Got same issue as well, not a private cluster, no NSG, no UDR, no custom DNS.

Deployment failed. Correlation ID: d9347991-56a4-445d-9fb6-719f5382f7d1. Agents are unable to resolve Kube rnetes API server name. It's likely custom DNS server is not correctly configured, please see https://aka. ms/aks/private-cluster#hub-and-spoke-with-custom-dns for more information. Details: VMSSAgentPoolReconcile r retry failed: { "Category": "InternalError", "SubCode": "LookupNoSuchHost", "Dependency": "Microsoft.Compute/VirtualMachineScaleSet", "OriginError": { "code": "VMExtensionProvisioningError", "message": "VM has reported a failure when processing extension 'vmssCSE'. Error message: \"Enable fail ed: failed to execute command: command terminated with exit status=52\n[stdout]\nes/pause:1.3.1 --pod-mani fest-path=/etc/kubernetes/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 -- resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-connection-idle-timeo ut=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES _128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA _WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCMSHA384,TLS RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/k ubeletserver.key\n +-4080 /opt/cni/bin/azure-vnet-telemetry -d /opt/cni/bin\n\nMar 16 15:21:31 aks-agen tpool-12471952-vmss000003 kubelet[3567]: I0316 15:21:31.063103 3567 kube_docker_client.go:342] Pulling ima ge \\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021\\": \\"Digest: sha256:2da7 99d9a188e6ffb278ad71aa14e5a8f3c6b0e873aad4fb5fc8147feceaa5e1\\"\nMar 16 15:21:35 aks-agentpool-12471952- vmss000003 kubelet[3567]: I0316 15:21:35.017567 3567 kube_docker_client.go:345] Stop pulling image \\"mcr .microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021\\": \\"Status: Downloaded newer imag e for mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021\\"\nMar 16 15:21:35 aks-age ntpool-12471952-vmss000003 kubelet[3567]: I0316 15:21:35.031817 3567 azure_credentials.go:247] image(mcr.m icrosoft.com/azuremonitor/containerinsights/ciprod) is not from ACR, return empty authentication\nMar 16 15:21:35 aks-agentpool-12471952-vmss000003 kubelet[3567]: I0316 15:21:35.189293 3567 kube_docker_client.go :345] Stop pulling image \\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021\\": \ \"Status: Image is up to date for mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021\ \"\nMar 16 15:21:37 aks-agentpool-12471952-vmss000003 kubelet[3567]: E0316 15:21:37.254009 3567 webhook. go:111] Failed to make webhook authenticator request: Post https://aksbug-dns-cbeb3faf.hcp.francecentral.a zmk8s.io:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup aksbug-dns-cbeb3faf.hcp.francece ntral.azmk8s.io: no such host\nMar 16 15:21:37 aks-agentpool-12471952-vmss000003 kubelet[3567]: E0316 15: 21:37.254487 3567 server.go:253] Unable to authenticate the request due to an error: Post https://aksbug-d ns-cbeb3faf.hcp.francecentral.azmk8s.io:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup a ksbug-dns-cbeb3faf.hcp.francecentral.azmk8s.io: no such host\nMar 16 15:21:37 aks-agentpool-12471952-vmss 000003 kubelet[3567]: I0316 15:21:37.276141 3567 kubelet.go:1955] SyncLoop (PLEG): \\"omsagent-qcgpg_kube -system(ce42a61d-7f10-4262-8ef8-12e9f72ec7f0)\\", event: \u0026pleg.PodLifecycleEvent{ID:\\"ce42a61d-7f1 0-4262-8ef8-12e9f72ec7f0\\", Type:\\"ContainerStarted\\", Data:\\"3d6717c6c75e860f36c8beaffd71386060ef 4fd05fa18928348b9d43fe8d2131\\"}\nMar 16 15:21:37 aks-agentpool-12471952-vmss000003 kubelet[3567]: I0316 15:21:37.286395 3567 kubelet.go:1955] SyncLoop (PLEG): \\"omsagent-rs-785b8584db-jbsps_kube-system(ab9a0 a27-7c78-4c59-bdbc-07c60ad46dcf)\\", event: \u0026pleg.PodLifecycleEvent{ID:\\"ab9a0a27-7c78-4c59-bdbc-0 7c60ad46dcf\\", Type:\\"ContainerStarted\\", Data:\\"5b1e6c2de47a3cdb6e7bf795d9a5ea54c3e45a497571ddd47 50abd5b9eef297c\\"}\nMar 16 15:21:37 aks-agentpool-12471952-vmss000003 kubelet[3567]: E0316 15:21:37.870 116 3567 webhook.go:111] Failed to make webhook authenticator request: Post https://aksbug-dns-cbeb3faf.hc p.francecentral.azmk8s.io:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup aksbug-dns-cbeb 3faf.hcp.francecentral.azmk8s.io: no such host\nMar 16 15:21:37 aks-agentpool-12471952-vmss000003 kubelet [3567]: E0316 15:21:37.870181 3567 server.go:253] Unable to authenticate the request due to an error: Post https://aksbug-dns-cbeb3faf.hcp.francecentral.azmk8s.io:443/apis/authentication.k8s.io/v1/tokenreviews: d ial tcp: lookup aksbug-dns-cbeb3faf.hcp.francecentral.azmk8s.io: no such host\", \"Error\": \"\", \"ExecDu ration\": \"142\" }\n\n[stderr]\n\"\r\n\r\nMore information on troubleshooting is available at https://aka .ms/VMExtensionCSELinuxTroubleshoot ", "target": null, "details": null, "innererror": null, "additionalInfo": null } }

karyjac commented 3 years ago

@jbouzekri in which region is your cluster deployed?

jbouzekri commented 3 years ago

@karyjac : In francecentral

karyjac commented 3 years ago

@jbouzekri same, I guess this is a bug....maybe DNS replication but can't say for sure. @palma21 are you aware of any bug/know issue affecting this region?

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

smnmtzgr commented 3 years ago

I have the same issue. But can only reproduce it with k8s v1.20.5. With k8s v1.19.9 it works fine. The region is germanywestcentral.

Please investigate in this issue, its a really bad behaviour and the bug should be found as fast as possible.

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

miwithro commented 3 years ago

@nickkeller please investigate

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

wangyira commented 1 year ago

Are you still seeing this issue? cc @qpetraroia

drazkiewicz commented 1 year ago

I am seeing this issue still. Twice in a space of week. Brand new private clusters. DNS setup correct and verified. After stop / start all seems fine.

alvinli222 commented 1 year ago

Hi folks, we are reviewing this again and will provide an update soon.

ashwajce commented 6 months ago

We are still facing this issue, Many of our clusters fails with getaddrinfo for host \"aks***-7zi9a8cl.hcp.francecentral.azmk8s.io\" port 443: Name or service not known error 1.28 version frc and weu region clusters