Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

Containers Cannot Start After Node Pool Scales #1531

Closed rgrwatson85 closed 4 years ago

rgrwatson85 commented 4 years ago

What happened: I used the Kubernetes dashboard to scale my deployment up to 110 pods to force the node pool to scale up. After a VMSS node pool scales up, none of the containers running on a new VMSS instance are able to start up correctly when Kubernetes is managing the lifecycle. We are using this cluster to host Azure Pipelines build agents, and part of the container start up script is to download the agent service from our Azure Devops tenant.

Here is the block of code that is causing the failure.

print_header "[1/5] Determining matching Azure Pipelines agent..."

AZP_AGENT_RESPONSE=$(curl -LsS \
  -u user:$(cat "$AZP_TOKEN_FILE") \
  -H 'Accept:application/json;api-version=3.0-preview' \
  "$VSTS_URL/_apis/distributedtask/packages/agent?platform=linux-x64")

The logs for the pods all show the same error message. curl: (6) Could not resolve host: xxxxxxx.visualstudio.com; Unknown error

Here is the kicker - If I SSH into the AKS node that is trying to run this pod, I can start a new container successfully using docker run ... and it downloads and registers the agent service with Azure DevOps successfully.

This only happens on the nodes that were created due to the scale operation, and the previously existing node is able to create and delete pods without issue.

What you expected to happen: I expected the new VMSS instance created by AKS to be able to start my container successfully.

How to reproduce it (as minimally and precisely as possible):

  1. Create an AKS cluster with a Linux node pool backed by a VMSS with only 1 instance.
  2. Create a deployment to AKS where the container image is created similarly to this process.
  3. Ensure that the deployment does not require an additional node to be created.
  4. Ensure that the pod starts up correctly and the agent service(s) has registered with Azure DevOps.
  5. From the Kubernetes Dashboard, scale the deployment up to >=110 pods.
  6. Wait for scaling operation to complete.
  7. Review status of pods running on the new node. The pods should have a status of Running and the container startup should be failing when determining the download url for the agent service.
  8. SSH into the node created in the scale up operation.
  9. Attempt to start a container instance using the appropriate docker commands.
  10. Ensure the container starts successfully and that the agent service has registered with Azure DevOps.

Anything else we need to know?:

Environment:

github-actions[bot] commented 4 years ago

Action required from @Azure/aks-pm

rgrwatson85 commented 4 years ago

The reason this failed is due to the cluster using a custom route table. When a new node was created, it could not do any DNS resolution. We have worked with Microsoft’s global black belts to get this issue resolved.