Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

[BUG] FailedCreatePodSandBox errors #4342

Open kevinharing opened 4 months ago

kevinharing commented 4 months ago

Describe the bug Since around 2024-05-08T12:00:00Z we're seeing the below error popping up in the kube events table a lot. This error pops up on all our clusters (4 in total). It happens on all namespaces and it seems related to spinning up the pods and the node not being able to connect to a certain service running on localhost.

FailedCreatePodSandBox

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox '3e16c561e9d521f52cd10e2a61965d2e063c27ac336f465aeee50eb77c055935': plugin type='azure-vnet' failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS with error: %w: http request failed: Post 'http://localhost:10090/network/requestipconfig': dial tcp 127.0.0.1:10090: connect: connection refused

To Reproduce Run a deployment.

Expected behavior No errors should occur when running a deployment.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

JoeyC-Dev commented 4 months ago

Based on your error message "Failed to get IP address from CNS with error", it looks like you are using "Azure CNI for dynamic IP allocation". So I did a set-up like below:

# Basic parameter
ranNum=$(echo $RANDOM)
rG=aks-test-${ranNum}
aks=aks-test-${ranNum}
aksVer=1.29.4
logAnalyticsWorkspace=aks-law-${ranNum}
location=southeastasia

# AKS set-up (Azure CNI for dynamic IP allocation)
az group create -n ${rG} -l ${location} -o none
groupID=$(az group show -n ${rG} --query id -o tsv)

az monitor log-analytics workspace create -n ${logAnalyticsWorkspace} -g ${rG} -o none
logAnalyticsWorkspaceId=$(az resource show -n ${logAnalyticsWorkspace} -g ${rG} \
--namespace Microsoft.OperationalInsights --resource-type workspaces --query id -o tsv)

az network vnet create -g ${rG} -n aks-vnet --address-prefixes 10.0.0.0/8 -o none 
az network vnet subnet create -g ${rG} --vnet-name aks-vnet --name node1 --address-prefixes 10.240.0.0/16 -o none 
az network vnet subnet create -g ${rG} --vnet-name aks-vnet --name pod1 --address-prefixes 10.241.0.0/16 -o none 

az aks create -n ${aks} -g ${rG} --node-vm-size Standard_A4_v2 --node-count 2 \
-a monitoring --workspace-resource-id ${logAnalyticsWorkspaceId} --enable-syslog \
--vnet-subnet-id ${groupID}/providers/Microsoft.Network/virtualNetworks/aks-vnet/subnets/node1 \
--pod-subnet-id ${groupID}/providers/Microsoft.Network/virtualNetworks/aks-vnet/subnets/pod1 \
--network-plugin azure --kubernetes-version ${aksVer} --node-os-upgrade-channel None --no-ssh-key -o none

# Deploy example deployment
az aks get-credentials -n ${aks} -g ${rG}

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
EOF

# Debug info
CNSinfo=$(kubectl get ds azure-cns -n kube-system -o json  | jq -r '.spec.template.spec.containers' | jq '[.[]|select(.name=="cns-container")][0].image')
echo "Your CNS image info: ${CNSinfo}"

# My result: "mcr.microsoft.com/containernetworking/azure-cns:v1.5.26"

I completed setting up my AKS around 2024-06-08T17:43:24Z (when nginx-deployment has been deployed), and then find no error at all. image image No abnormal events on Pods: image

After then, I just go set another script to repeatedly delete and create Pods:

#!/bin/bash
i=1
while [ $i -ge 1 ]
do
echo "Nginx deployment is being created ${i} time(s)."

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
EOF

sleep 40; kubectl delete deployment nginx-deployment; ((i+=1)); sleep 20
done

Then I go to sleep and back to see what happened after then. image Nothing wrong: (Remember the time I mentioned before? These errors are the ones appears during the initial set-up time, but never popped again later) image

So far, can see, it looks like something related to your network architecture. At least from initial set-up sight, there is nothing wrong by now.

It is suggested you provide more information about:

The info you provided is not enough to get where the issue is, as you see in this reply. If this is not reproducible issue, you probably have to open a support ticket. image But since you have this issue among all 4 AKS instances, so it is possible that this is reproducible, but the info is not enough to reproducible it.

If you have raised a support request, you can also mention it here. Because there is still a chance that this issue is not-reproducible outside your environment. It is suggested to open a support ticket to let support engineer see what is happening. image

kevinharing commented 4 months ago

@JoeyC-Dev We indeed use Azure CNI for dynamic IP allocation. I looked into the logs a bit more and I can confirm it only happens when the node has just spun up and right when the pods are being deployed. Just like you show in your investigation. Our workloads are really dynamic and sometimes pretty large (80+ nodes needed for a deployment on some occasions), so that explains why I see it pop up in the logs so often. If this is normal operation after a node is spun up, I suggest not logging the warning or gracefully handle the error. Because currently the event log is being flooded with these messages as you can imagine.

My CNS version is v1.5.26 just like yours BTW and support request is # 2406070050001264.

JoeyC-Dev commented 4 months ago

@kevinharing I usually ignore these logs because they only appear during set-up process. I never thought this is a bug and not sure if this should be categorized as bug. The only thing I can suggest is to change your title and see any PM or PG picked it up, as it is verified "reproducible" like the script I used above, but just not sure if this is bug-as-intended (if so then it ends here and probably will become won't fix). If the support engineer which is assigned to your case believe that this is something needs to be submitted to PG team, then the support ticket will give you a more responsive replay as SE are demand to provide updates if Cx needs (but it must be categorized as bug first). The GitHub issues here are all "best-effort" only. But at least for now, we know when it happened.

Also, it is suggested for you to provide your business impact to SE and tell them how it is affecting your product.

Based on your description "It happens on all namespaces", I also suggest you to check if the "node.kubernetes.io/unreachable" label for nodes disappears before the CNS error log stops. If so, this then might be the root cause (again, I don't know if early deletion of this label is the root cause or this is another "intended" thing). But ofc, this is also SE's duty to check. So it is up to you.

You can try to use Deallocated mode as workaround, but this also introduced another existing bug-as-intended issue: https://github.com/Azure/AKS/issues/4313 The reason why this can be work as workaround (or potentially) is because this node is never being deregistered, but just being deallcoated (and the disk is preserved). So this might work.

Wish you good luck.

mortenjoenby commented 4 months ago

We just started seeing this on 1.28.5. Friday we had one issue, and just now this morning. Running with CNI as well.

mortenjoenby commented 4 months ago

@JoeyC-Dev, we are seeing that azure-cns on the worker node is failing when this happens:

cns-container 2024/06/24 07:09:02 [1] [Configuration] Using config path: /etc/azure-cns/cns_config.json
cns-container 2024/06/24 07:09:02 [1] GetAzureCloud querying url: http://169.254.169.254/metadata/instance/compute/azEnvironment?api-version=2018-10-01&format=text
cns-container 2024/06/24 07:09:02 [1] [Utils] Initializing HTTP client with connection timeout: 7, response header timeout: 7
cns-container 2024/06/24 07:09:02 [1] AI Telemetry Handle created
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] Using config: &{AZRSettings:{PopulateHomeAzCacheRetryIntervalSecs:60} AsyncPodDeletePath:/var/run/azure-vnet/deleteIDs CNIConflistFilepath:/etc/cni/net.d/15-azure-swift-overlay.conflist CNIConflistScenario:v4overlay ChannelMode:CRD EnableAsyncPodDelete:true EnableCNIConflistGeneration:false EnableIPAMv2:false EnablePprof:false EnableStateMigration:false EnableSubnetScarcity:false EnableSwiftV2:false InitializeFromCNI:true KeyVaultSettings:{URL: CertificateName: RefreshIntervalInHrs:12} MSISettings:{ResourceID:} ManageEndpointState:false ManagedSettings:{PrivateEndpoint: InfrastructureNetworkID: NodeID: NodeSyncIntervalInSeconds:30} MellanoxMonitorIntervalSecs:0 MetricsBindAddress::10092 ProgramSNATIPTables:false SWIFTV2Mode: SyncHostNCTimeoutMs:500 SyncHostNCVersionIntervalMs:1000 TLSCertificatePath: TLSEndpoint: TLSPort: TLSSubjectName: TelemetrySettings:{DisableAll:false DisableTrace:false DisableMetric:false DisableEvent:false TelemetryBatchSizeBytes:16384 TelemetryBatchIntervalInSecs:15 HeartBeatIntervalInMins:30 DisableMetadataRefreshThread:false RefreshIntervalInSecs:15 DebugMode:false SnapshotIntervalInMins:60 AppInsightsInstrumentationKey:} UseHTTPS:false WatchPods:false WireserverIP:168.63.129.16}
cns-container 2024/06/24 07:09:02 [1] Running on Linux version 5.15.0-1058-azure (buildd@bos03-amd64-035) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #66-Ubuntu SMP Fri Feb 16 00:40:24 UTC 2024
cns-container ⇨ http server started on :10092
init-cni-dropgz ts=1719212922.7231965 level=info msg="wrote file" sources=azure-vnet,azure-vnet-telemetry,azure-swift-overlay.conflist outputs=/opt/cni/bin/azure-vnet,/opt/cni/bin/azure-vnet-telemetry,/etc/cni/net.d/15-azure-swift-overlay.conflist cmd=deploy src=azure-vnet dest=/opt/cni/bin/azure-vnet
init-cni-dropgz ts=1719212922.7917175 level=info msg="wrote file" sources=azure-vnet,azure-vnet-telemetry,azure-swift-overlay.conflist outputs=/opt/cni/bin/azure-vnet,/opt/cni/bin/azure-vnet-telemetry,/etc/cni/net.d/15-azure-swift-overlay.conflist cmd=deploy src=azure-vnet-telemetry dest=/opt/cni/bin/azure-vnet-telemetry
init-cni-dropgz ts=1719212922.791925 level=info msg="wrote file" sources=azure-vnet,azure-vnet-telemetry,azure-swift-overlay.conflist outputs=/opt/cni/bin/azure-vnet,/opt/cni/bin/azure-vnet-telemetry,/etc/cni/net.d/15-azure-swift-overlay.conflist cmd=deploy src=azure-swift-overlay.conflist dest=/etc/cni/net.d/15-azure-swift-overlay.conflist
init-cni-dropgz ts=1719212922.7919345 level=info msg="successfully wrote files" sources=azure-vnet,azure-vnet-telemetry,azure-swift-overlay.conflist outputs=/opt/cni/bin/azure-vnet,/opt/cni/bin/azure-vnet-telemetry,/etc/cni/net.d/15-azure-swift-overlay.conflist cmd=deploy
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] GetPrimaryInterfaceInfoFromHost
cns-container 2024/06/24 07:09:02 [1] [Telemetry] Request metadata from wireserver
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] Response received from NMAgent for get interface details: <Interfaces><Interface MacAddress="6045BDF58D22" IsPrimary="true"><IPSubnet Prefix="10.210.8.0/22"><IPAddress Address="10.210.8.11" IsPrimary="true"/></IPSubnet></Interface></Interfaces>
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] Initialize HTTPRestService
cns-container 2024/06/24 07:09:02 [1] HTTP listener will be started later after CNS state has been reconciled
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] restoreState
Stream closed EOF for kube-system/azure-cns-bk2ph (init-cni-dropgz)
cns-container 2024/06/24 07:09:02 [1] [Azure CNS]  Restored state, &{Location: NetworkType: OrchestratorType:KubernetesCRD NodeID: Initialized:false ContainerIDByOrchestratorContext:map] ContainerStatus:map[0010e16d-7912-45ac-89f2-117647df3841:{ID:0010e16d-7912-45ac-89f2-117647df3841 VMVersion:0 HostVersion:0 CreateNetworkContainerRequest:{HostPrimaryIP:10.210.8.11 Version:0 NetworkContainerType:Docker NetworkContainerid:0010e16d-7912-45ac-89f2-117647df3841 PrimaryInterfaceIdentifier: AuthorizationToken: LocalIPConfiguration:{IPSubnet:{IPAddress: PrefixLength:0} DNSServers:] GatewayIPAddress:} OrchestratorContext:[110 117 108 108] IPConfiguration:{IPSubnet:{IPAddress:172.16.95.0 PrefixLength:12} DNSServers:] GatewayIPAddress:} SecondaryIPConfigs:map[172.16.95.0:{IPAddress:172.16.95.0 NCVersion:0} 172.16.95.1:{IPAddress:172.16.95.1 NCVersion:0} 172.16.95.10:{IPAddress:172.16.95.10 NCVersion:0} 172.16.95.100:{IPAddress:172.16.95.100 NCVersion:0} 172.16.95.101:{IPAddress:172.16.95.101 NCVersion:0} 172.16.95.102:{IPAddress:172.16.95.102 NCVersion:0} 172.16.95.103:{IPAddress:172.16.95.103 NCVersion:0} 172.16.95.104:{IPAddress:172.16.95.104 NCVersion:0} 172.16.95.105:{IPAddress:172.16.95.105 NCVersion:0} 172.16.95.106:{IPAddress:172.16.95.106 NCVersion:0} 172.16.95.107:{IPAddress:172.16.95.107 NCVersion:0} 172.16.95.108:{IPAddress:172.16.95.108 NCVersion:0} 172.16.95.109:{IPAddress:172.16.95.109 NCVersion:0} 172.16.95.11:{IPAddress:172.16.95.11 NCVersion:0} 172.16.95.110:{IPAddress:172.16.95.110 NCVersion:0} 172.16.95.111:{IPAddress:172.16.95.111 NCVersion:0} 172.16.95.112:{IPAddress:172.16.95.112 NCVersion:0} 172.16.95.113:{IPAddress:172.16.95.113 NCVersion:0} 172.16.95.114:{IPAddress:172.16.95.114 NCVersion:0} 172.16.95.115:{IPAddress:172.16.95.115 NCVersion:0} 172.16.95.116:{IPAddress:172.16.95.116 NCVersion:0} 172.16.95.117:{IPAddress:172.16.95.117 NCVersion:0} 172.16.95.118:{IPAddress:172.16.95.118 NCVersion:0} 172.16.95.119:{IPAddress:172.16.95.119 NCVersion:0} 172.16.95.12:{IPAddress:172.16.95.12 NCVersion:0} 172.16.95.120:{IPAddress:172.16.95.120 NCVersion:0} 172.16.95.121:{IPAddress:172.16.95.121 NCVersion:0} 172.16.95.122:{IPAddress:172.16.95.122 NCVersion:0} 172.16.95.123:{IPAddress:172.16.95.123 NCVersion:0} 172.16.95.124:{IPAddress:172.16.95.124 NCVersion:0} 172.16.95.125:{IPAddress:172.16.95.125 NCVersion:0} 172.16.95.126:{IPAddress:172.16.95.126 NCVersion:0} 172.16.95.127:{IPAddress:172.16.95.127 NCVersion:0} 172.16.95.128:{IPAddress:172.16.95.128 NCVersion:0} 172.16.95.129:{IPAddress:172.16.95.129 NCVersion:0} 172.16.95.13:{IPAddress:172.16.95.13 NCVersion:0} 172.16.95.130:{IPAddress:172.16.95.130 NCVersion:0} 172.16.95.131:{IPAddress:172.16.95.131 NCVersion:0} 172.16.95.132:{IPAddress:172.16.95.132 NCVersion:0} 172.16.95.133:{IPAddress:172.16.95.133 NCVersion:0} 172.16.95.134:{IPAddress:172.16.95.134 NCVersion:0} 172.16.95.135:{IPAddress:172.16.95.135 NCVersion:0} 172.16.95.136:{IPAddress:172.16.95.136 NCVersion:0} 172.16.95.137:{IPAddress:172.16.95.137 NCVersion:0} 172.16.95.138:{IPAddress:172.16.95.138 NCVersion:0} 172.16.95.139:{IPAddress:172.16.95.139 NCVersion:0} 172.16.95.14:{IPAddress:172.16.95.14 NCVersion:0} 172.16.95.140:{IPAddress:172.16.95.140 NCVersion:0} 172.16.95.141:{IPAddress:172.16.95.141 NCVersion:0} 172.16.95.142:{IPAddress:172.16.95.142 NCVersion:0} 172.16.95.143:{IPAddress:172.16.95.143 NCVersion:0} 172.16.95.144:{IPAddress:172.16.95.144 NCVersion:0} 172.16.95.145:{IPAddress:172.16.95.145 NCVersion:0} 172.16.95.146:{IPAddress:172.16.95.146 NCVersion:0} 172.16.95.147:{IPAddress:172.16.95.147 NCVersion:0} 172.16.95.148:{IPAddress:172.16.95.148 NCVersion:0} 172.16.95.149:{IPAddress:172.16.95.149 NCVersion:0} 172.16.95.15:{IPAddress:172.16.95.15 NCVersion:0} 172.16.95.150:{IPAddress:172.16.95.150 NCVersion:0} 172.16.95.151:{IPAddress:172.16.95.151 NCVersion:0} 172.16.95.152:{IPAddress:172.16.95.152 NCVersion:0} 172.16.95.153:{IPAddress:172.16.95.153 NCVersion:0} 172.16.95.154:{IPAddress:172.16.95.154 NCVersion:0} 172.16.95.155:{IPAddress:172.16.95.155 NCVersion:0} 172.16.95.156:{IPAddress:172.16.95.156 NCVersion:0} 172.16.95.157:{IPAddress:172.16.95.157 NCVersion:0} 172.16.95.158:{IPAddress:172.16.95.158 NCVersion:0} 172.16.95.159:{IPAddress:172.16.95.159 NCVersion:0} 172.16.95.16:{IPAddress:172.16.95.16 NCVersion:0} 172.16.95.160:{IPAddress:172.16.95.160 NCVersion:0} 172.16.95.161:{IPAddress:172.16.95.161 NCVersion:0} 172.16.95.162:{IPAddress:172.16.95.162 NCVersion:0} 172.16.95.163:{IPAddress:172.16.95.163 NCVersion:0} 172.16.95.164:{IPAddress:172.16.95.164 NCVersion:0} 172.16.95.165:{IPAddress:172.16.95.165 NCVersion:0} 172.16.95.166:{IPAddress:172.16.95.166 NCVersion:0} 172.16.95.167:{IPAddress:172.16.95.167 NCVersion:0} 172.16.95.168:{IPAddress:172.16.95.168 NCVersion:0} 172.16.95.169:{IPAddress:172.16.95.169 NCVersion:0} 172.16.95.17:{IPAddress:172.16.95.17 NCVersion:0} 172.16.95.170:{IPAddress:172.16.95.170 NCVersion:0} 172.16.95.171:{IPAddress:172.16.95.171 NCVersion:0} 172.16.95.172:{IPAddress:172.16.95.172 NCVersion:0} 172.16.95.173:{IPAddress:172.16.95.173 NCVersion:0} 172.16.95.174:{IPAddress:172.16.95.174 NCVersion:0} 172.16.95.175:{IPAddress:172.16.95.175 NCVersion:0} 172.16.95.176:{IPAddress:172.16.95.176 NCVersion:0} 172.16.95.177:{IPAddress:172.16.95.177 NCVersion:0} 172.16.95.178:{IPAddress:172.16.95.178 NCVersion:0} 172.16.95.179:{IPAddress:172.16.95.179 NCVersion:0} 172.16.95.18:{IPAddress:172.16.95.18 NCVersion:0} 172.16.95.180:{IPAddress:172.16.95.180 NCVersion:0} 172.16.95.181:{IPAddress:172.16.95.181 NCVersion:0} 172.16.95.182:{IPAddress:172.16.95.182 NCVersion:0} 172.16.95.183:{IPAddress:172.16.95.183 NCVersion:0} 172.16.95.184:{IPAddress:172.16.95.184 NCVersion:0} 172.16.95.185:{IPAddress:172.16.95.185 NCVersion:0} 172.16.95.186:{IPAddress:172.16.95.186 NCVersion:0} 172.16.95.187:{IPAddress:172.16.95.187 NCVersion:0} 172.16.95.188:{IPAddress:172.16.95.188 NCVersion:0} 172.16.95.189:{IPAddress:172.16.95.189 NCVersion:0} 172.16.95.19:{IPAddress:172.16.95.19 NCVersion:0} 172.16.95.190:{IPAddress:172.16.95.190 NCVersion:0} 172.16.95.191:{IPAddress:172.16.95.191 NCVersion:0} 172.16.95.192:{IPAddress:172.16.95.192 NCVersion:0} 172.16.95.193:{IPAddress:172.16.95.193 NCVersion:0} 172.16.95.194:{IPAddress:172.16.95.194 NCVersion:0} 172.16.95.195:{IPAddress:172.16.95.195 NCVersion:0} 172.16.95.196:{IPAddress:172.16.95.196 NCVersion:0} 172.16.95.197:{IPAddress:172.16.95.197 NCVersion:0} 172.16.95.198:{IPAddress:172.16.95.198 NCVersion:0} 172.16.95.199:{IPAddress:172.16.95.199 NCVersion:0} 172.16.95.2:{IPAddress:172.16.95.2 NCVersion:0} 172.16.95.20:{IPAddress:172.16.95.20 NCVersion:0} 172.16.95.200:{IPAddress:172.16.95.200 NCVersion:0} 172.16.95.201:{IPAddress:172.16.95.201 NCVersion:0} 172.16.95.202:{IPAddress:172.16.95.202 NCVersion:0} 172.16.95.203:{IPAddress:172.16.95.203 NCVersion:0} 172.16.95.204:{IPAddress:172.16.95.204 NCVersion:0} 172.16.95.205:{IPAddress:172.16.95.205 NCVersion:0} 172.16.95.206:{IPAddress:172.16.95.206 NCVersion:0} 172.16.95.207:{IPAddress:172.16.95.207 NCVersion:0} 172.16.95.208:{IPAddress:172.16.95.208 NCVersion:0} 172.16.95.209:{IPAddress:172.16.95.209 NCVersion:0} 172.16.95.21:{IPAddress:172.16.95.21 NCVersion:0} 172.16.95.210:{IPAddress:172.16.95.210 NCVersion:0} 172.16.95.211:{IPAddress:172.16.95.211 NCVersion:0} 172.16.95.212:{IPAddress:172.16.95.212 NCVersion:0} 172.16.95.213:{IPAddress:172.16.95.213 NCVersion:0} 172.16.95.214:{IPAddress:172.16.95.214 NCVersion:0} 172.16.95.215:{IPAddress:172.16.95.215 NCVersion:0} 172.16.95.216:{IPAddress:172.16.95.216 NCVersion:0} 172.16.95.217:{IPAddress:172.16.95.217 NCVersion:0} 172.16.95.218:{IPAddress:172.16.95.218 NCVersion:0} 172.16.95.219:{IPAddress:172.16.95.219 NCVersion:0} 172.16.95.22:{IPAddress:172.16.95.22 NCVersion:0} 172.16.95.220:{IPAddress:172.16.95.220 NCVersion:0} 172.16.95.221:{IPAddress:172.16.95.221 NCVersion:0} 172.16.95.222:{IPAddress:172.16.95.222 NCVersion:0} 172.16.95.223:{IPAddress:172.16.95.223 NCVersion:0} 172.16.95.224:{IPAddress:172.16.95.224 NCVersion:0} 172.16.95.225:{IPAddress:172.16.95.225 NCVersion:0} 172.16.95.226:{IPAddress:172.16.95.226 NCVersion:0} 172.16.95.227:{IPAddress:172.16.95.227 NCVersion:0} 172.16.95.228:{IPAddress:172.16.95.228 NCVersion:0} 172.16.95.229:{IPAddress:172.16.95.229 NCVersion:0} 172.16.95.23:{IPAddress:172.16.95.23 NCVersion:0} 172.16.95.230:{IPAddress:172.16.95.230 NCVersion:0} 172.16.95.231:{IPAddress:172.16.95.231 NCVersion:0} 172.16.95.232:{IPAddress:172.16.95.232 NCVersion:0} 172.16.95.233:{IPAddress:172.16.95.233 NCVersion:0} 172.16.95.234:{IPAddress:172.16.95.234 NCVersion:0} 172.16.95.235:{IPAddress:172.16.95.235 NCVersion:0} 172.16.95.236:{IPAddress:172.16.95.236 NCVersion:0} 172.16.95.237:{IPAddress:172.16.95.237 NCVersion:0} 172.16.95.238:{IPAddress:172.16.95.238 NCVersion:0} 172.16.95.239:{IPAddress:172.16.95.239 NCVersion:0} 172.16.95.24:{IPAddress:172.16.95.24 NCVersion:0} 172.16.95.240:{IPAddress:172.16.95.240 NCVersion:0} 172.16.95.241:{IPAddress:172.16.95.241 NCVersion:0} 172.16.95.242:{IPAddress:172.16.95.242 NCVersion:0} 172.16.95.243:{IPAddress:172.16.95.243 NCVersion:0} 172.16.95.244:{IPAddress:172.16.95.244 NCVersion:0} 172.16.95.245:{IPAddress:172.16.95.245 NCVersion:0} 172.16.95.246:{IPAddress:172.16.95.246 NCVersion:0} 172.16.95.247:{IPAddress:172.16.95.247 NCVersion:0} 172.16.95.248:{IPAddress:172.16.95.248 NCVersion:0} 172.16.95.249:{IPAddress:172.16.95.249 NCVersion:0} 172.16.95.25:{IPAddress:172.16.95.25 NCVersion:0} 172.16.95.250:{IPAddress:172.16.95.250 NCVersion:0} 172.16.95.251:{IPAddress:172.16.95.251 NCVersion:0} 172.16.95.252:{IPAddress:172.16.95.252 NCVersion:0} 172.16.95.253:{IPAddress:172.16.95.253 NCVersion:0} 172.16.95.254:{IPAddress:172.16.95.254 NCVersion:0} 172.16.95.255:{IPAddress:172.16.95.255 NCVersion:0} 172.16.95.26:{IPAddress:172.16.95.26 NCVersion:0} 172.16.95.27:{IPAddress:172.16.95.27 NCVersion:0} 172.16.95.28:{IPAddress:172.16.95.28 NCVersion:0} 172.16.95.29:{IPAddress:172.16.95.29 NCVersion:0} 172.16.95.3:{IPAddress:172.16.95.3 NCVersion:0} 172.16.95.30:{IPAddress:172.16.95.30 NCVersion:0} 172.16.95.31:{IPAddress:172.16.95.31 NCVersion:0} 172.16.95.32:{IPAddress:172.16.95.32 NCVersion:0} 172.16.95.33:{IPAddress:172.16.95.33 NCVersion:0} 172.16.95.34:{IPAddress:172.16.95.34 NCVersion:0} 172.16.95.35:{IPAddress:172.16.95.35 NCVersion:0} 172.16.95.36:{IPAddress:172.16.95.36 NCVersion:0} 172.16.95.37:{IPAddress:172.16.95.37 NCVersion:0} 172.16.95.38:{IPAddress:172.16.95.38 NCVersion:0} 172.16.95.39:{IPAddress:172.16.95.39 NCVersion:0} 172.16.95.4:{IPAddress:172.16.95.4 NCVersion:0} 172.16.95.40:{IPAddress:172.16.95.40 NCVersion:0} 172.16.95.41:{IPAddress:172.16.95.41 NCVersion:0} 172.16.95.42:{IPAddress:172.16.95.42 NCVersion:0} 172.16.95.43:{IPAddress:172.16.95.43 NCVersion:0} 172.16.95.44:{IPAddress:172.16.95.44 NCVersion:0} 172.16.95.45:{IPAddress:172.16.95.45 NCVersion:0} 172.16.95.46:{IPAddress:172.16.95.46 NCVersion:0} 172.16.95.47:{IPAddress:172.16.95.47 NCVersion:0} 172.16.95.48:{IPAddress:172.16.95.48 NCVersion:0} 172.16.95.49:{IPAddress:172.16.95.49 NCVersion:0} 172.16.95.5:{IPAddress:172.16.95.5 NCVersion:0} 172.16.95.50:{IPAddress:172.16.95.50 NCVersion:0} 172.16.95.51:{IPAddress:172.16.95.51 NCVersion:0} 172.16.95.52:{IPAddress:172.16.95.52 NCVersion:0} 172.16.95.53:{IPAddress:172.16.95.53 NCVersion:0} 172.16.95.54:{IPAddress:172.16.95.54 NCVersion:0} 172.16.95.55:{IPAddress:172.16.95.55 NCVersion:0} 172.16.95.56:{IPAddress:172.16.95.56 NCVersion:0} 172.16.95.57:{IPAddress:172.16.95.57 NCVersion:0} 172.16.95.58:{IPAddress:172.16.95.58 NCVersion:0} 172.16.95.59:{IPAddress:172.16.95.59 NCVersion:0} 172.16.95.6:{IPAddress:172.16.95.6 NCVersion:0} 172.16.95.60:{IPAddress:172.16.95.60 NCVersion:0} 172.16.95.61:{IPAddress:172.16.95.61 NCVersion:0} 172.16.95.62:{IPAddress:172.16.95.62 NCVersion:0} 172.16.95.63:{IPAddress:172.16.95.63 NCVersion:0} 172.16.95.64:{IPAddress:172.16.95.64 NCVersion:0} 172.16.95.65:{IPAddress:172.16.95.65 NCVersion:0} 172.16.95.66:{IPAddress:172.16.95.66 NCVersion:0} 172.16.95.67:{IPAddress:172.16.95.67 NCVersion:0} 172.16.95.68:{IPAddress:172.16.95.68 NCVersion:0} 172.16.95.69:{IPAddress:172.16.95.69 NCVersion:0} 172.16.95.7:{IPAddress:172.16.95.7 NCVersion:0} 172.16.95.70:{IPAddress:172.16.95.70 NCVersion:0} 172.16.95.71:{IPAddress:172.16.95.71 NCVersion:0} 172.16.95.72:{IPAddress:172.16.95.72 NCVersion:0} 172.16.95.73:{IPAddress:172.16.95.73 NCVersion:0} 172.16.95.74:{IPAddress:172.16.95.74 NCVersion:0} 172.16.95.75:{IPAddress:172.16.95.75 NCVersion:0} 172.16.95.76:{IPAddress:172.16.95.76 NCVersion:0} 172.16.95.77:{IPAddress:172.16.95.77 NCVersion:0} 172.16.95.78:{IPAddress:172.16.95.78 NCVersion:0} 172.16.95.79:{IPAddress:172.16.95.79 NCVersion:0} 172.16.95.8:{IPAddress:172.16.95.8 NCVersion:0} 172.16.95.80:{IPAddress:172.16.95.80 NCVersion:0} 172.16.95.81:{IPAddress:172.16.95.81 NCVersion:0} 172.16.95.82:{IPAddress:172.16.95.82 NCVersion:0} 172.16.95.83:{IPAddress:172.16.95.83 NCVersion:0} 172.16.95.84:{IPAddress:172.16.95.84 NCVersion:0} 172.16.95.85:{IPAddress:172.16.95.85 NCVersion:0} 172.16.95.86:{IPAddress:172.16.95.86 NCVersion:0} 172.16.95.87:{IPAddress:172.16.95.87 NCVersion:0} 172.16.95.88:{IPAddress:172.16.95.88 NCVersion:0} 172.16.95.89:{IPAddress:172.16.95.89 NCVersion:0} 172.16.95.9:{IPAddress:172.16.95.9 NCVersion:0} 172.16.95.90:{IPAddress:172.16.95.90 NCVersion:0} 172.16.95.91:{IPAddress:172.16.95.91 NCVersion:0} 172.16.95.92:{IPAddress:172.16.95.92 NCVersion:0} 172.16.95.93:{IPAddress:172.16.95.93 NCVersion:0} 172.16.95.94:{IPAddress:172.16.95.94 NCVersion:0} 172.16.95.95:{IPAddress:172.16.95.95 NCVersion:0} 172.16.95.96:{IPAddress:172.16.95.96 NCVersion:0} 172.16.95.97:{IPAddress:172.16.95.97 NCVersion:0} 172.16.95.98:{IPAddress:172.16.95.98 NCVersion:0} 172.16.95.99:{IPAddress:172.16.95.99 NCVersion:0}] MultiTenancyInfo:{EncapType: ID:0} CnetAddressSpace:] Routes:] AllowHostToNCCommunication:false AllowNCToHostCommunication:false EndpointPolicies:] NCStatus: NetworkInterfaceInfo:{NICType: MACAddress:}} VfpUpdateComplete:false}] Networks:map] TimeStamp:2024-06-24 07:08:48.203505172 +0000 UTC joinedNetworks:map] primaryInterface:0xc000342050}
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] Enter Restoring Network State
cns-container 2024/06/24 07:09:02 [1] [Azure CNS] Store timestamp is 2024-06-24 07:08:48.199602574 +0000 UTC.
cns-container 2024/06/24 07:09:02 [1] Failed to query uptime, err:exec: "uptime": executable file not found in $PATH
cns-container 2024/06/24 07:09:02 [1] [Utils] Initializing HTTP client with connection timeout: 5, response header timeout: 120
cns-container 2024/06/24 07:09:02 [1] SetContext details called with: KubernetesCRD orchestrator nodeID 
cns-container 2024/06/24 07:09:02 [1] [Azure CNS]  Listening.
cns-container 2024/06/24 07:09:02 [1] Acquiring process lock
cns-container 2024/06/24 07:09:02 [1] Acquired process lock with timeout value of 10s
cns-container 2024/06/24 07:09:02 [1] Released process lock
cns-container 2024/06/24 07:09:03 [1] Set GlobalPodInfoScheme 1 (InitializeFromCNI=true)
cns-container 2024/06/24 07:09:03 [1] [Azure CNS] setOrchestratorType
cns-container 2024/06/24 07:09:03 [1] SetContext details called with: KubernetesCRD orchestrator nodeID 
cns-container 2024/06/24 07:09:03 [1] [azure-cns] Sent cns.Response {ReturnCode:Success Message:}.
cns-container 2024/06/24 07:09:03 [1] Initializing from CNI
cns-container 2024/06/24 07:09:04 [1] Failed to start CRD Controller, err:failed to create CNI PodInfoProvider: failed to invoke CNI client.GetEndpointState(): failed to call Azure CNI bin with err: [exit status 1], output: [{
cns-container     "code": 11,
cns-container     "msg": "Failed to initialize key-value store of network plugin: error Acquiring store lock: processLock acquire error: lockedfile create error in lock: open /var/run/azure-vnet/azure-vnet.lock: permission denied"
cns-container }].
Stream closed EOF for kube-system/azure-cns-bk2ph (cns-container)
JoeyC-Dev commented 4 months ago

@mortenjoenby! This really looks like an ongoing outage because the error looks familiar. I am OOO now so cannot compare. Try to check if there is any resource health alert. Or wait until the end of this week and see if issue auto resolved (there is an ongoing hot fix). If not then open a support ticket instead.

mortenjoenby commented 4 months ago

Hi @JoeyC-Dev . I did actually create a support ticket this morning, and we had a meeting with Azure support. We had quite a few health alerts on that cluster Friday - CPU pressure, disk pressure. He gave us some recommndations - do not run user workloads in the system nodepool (we do!), add a taint to the system nodepool to avoid user workloads, reconfigure requests and limits on pods having a big difference between CPU requests and limits (we do see this on system pods also though, so not sure about this ...) - and consider upgrading nodepool OS image.

This is all good, and we will look into that, but can you add some details on the hotfix that you mention? What will it fix?

JoeyC-Dev commented 4 months ago

Hi @JoeyC-Dev . I did actually create a support ticket this morning, and we had a meeting with Azure support. We had quite a few health alerts on that cluster Friday - CPU pressure, disk pressure. He gave us some recommndations - do not run user workloads in the system nodepool (we do!), add a taint to the system nodepool to avoid user workloads, reconfigure requests and limits on pods having a big difference between CPU requests and limits (we do see this on system pods also though, so not sure about this ...) - and consider upgrading nodepool OS image.

This is all good, and we will look into that, but can you add some details on the hotfix that you mention? What will it fix?

@mortenjoenby I compared the error info with your provided one, and they look exactly the same (with the outage one). For all info made public: image

But since you did not get service health alert, I am thinking if your issue is really relating to this outage. IMO, check if error popped again by the end of the week. If still persists, consider the error is not related to the outage.

mortenjoenby commented 4 months ago

Hi @JoeyC-Dev . Apologies, I do see that Service Health alert! I was only looking for resource health on the AKS cluster yesterday. Let's see if problem goes away :) Thanks.

kevinharing commented 3 months ago

Still seeing pods waiting a long time in ContainerCreating state, waiting for what I assume an IP address. On newly created nodes that is.

kevinharing commented 2 months ago

FYI: Log messages still pop up on AKS 1.29.7 and containers are spending a few minutes in ContainerCreating state when the pod is deployed to a newly spun up node.

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads