Closed rgrwatson85 closed 4 years ago
Action required from @Azure/aks-pm
The reason this failed is due to the cluster using a custom route table. When a new node was created, it could not do any DNS resolution. We have worked with Microsoft’s global black belts to get this issue resolved.
What happened: I used the Kubernetes dashboard to scale my deployment up to 110 pods to force the node pool to scale up. After a VMSS node pool scales up, none of the containers running on a new VMSS instance are able to start up correctly when Kubernetes is managing the lifecycle. We are using this cluster to host Azure Pipelines build agents, and part of the container start up script is to download the agent service from our Azure Devops tenant.
Here is the block of code that is causing the failure.
The logs for the pods all show the same error message.
curl: (6) Could not resolve host: xxxxxxx.visualstudio.com; Unknown error
Here is the kicker - If I SSH into the AKS node that is trying to run this pod, I can start a new container successfully using
docker run ...
and it downloads and registers the agent service with Azure DevOps successfully.This only happens on the nodes that were created due to the scale operation, and the previously existing node is able to create and delete pods without issue.
What you expected to happen: I expected the new VMSS instance created by AKS to be able to start my container successfully.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
Kubernetes version (use
kubectl version
):Size of cluster (how many worker nodes are in the cluster?) Initially 1, and after scale operation, there are 2
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Hosts Azure Pipelines build agents
Others: The base image for the container is centos:7 and not ubuntu:16.04 as in the examples linked above