Open david-garcia-garcia opened 1 day ago
Running this script:
if (-not("dummy" -as [type])) {
add-type -TypeDefinition @"
using System;
using System.Net;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;
public static class Dummy {
public static bool ReturnTrue(object sender,
X509Certificate certificate,
X509Chain chain,
SslPolicyErrors sslPolicyErrors) { return true; }
public static RemoteCertificateValidationCallback GetDelegate() {
return new RemoteCertificateValidationCallback(Dummy.ReturnTrue);
}
}
"@
}
[System.Net.ServicePointManager]::ServerCertificateValidationCallback = [dummy]::GetDelegate()
$Token = Get-Content -Path "C:\var\run\secrets\kubernetes.io\serviceaccount\token"
Invoke-RestMethod -Uri "https://$env:KUBERNETES_SERVICE_HOST/apis/acn.azure.com/v1alpha/namespaces/kube-system/nodenetworkconfigs" -Headers @{Authorization = "Bearer $Token"} -Method Get
I am getting:
apiVersion items
---------- -----
acn.azure.com/v1alpha {@{apiVersion=acn.azure.com/v1alpha; kind=NodeNetworkConfig; metadata=; spec...
I see the service token is configured to rotate every 1 hour aprox.:
- name: kube-api-access-lj2hx
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
My impression is the service token is not being renewed on the pods residing on the windows node, so the issue is probably not in the CNI itself.
I can confirm that when the pod starts failing to auth, the token inside it has properly been renewed. First thought is that for whatever reason the CNS pod has a quirck in its windows implementation (or an upstream library), and it does not update the token, using the original one injected into the pod which is now stale.
On the Windows pod also getting this after recreating the pod:
{"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"} │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"} │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"} │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"} │
│ {"level":"info","ts":"2024-12-04T08:40:37.519Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}
Describe the bug
Using image: mcr.microsoft.com/containernetworking/azure-cns:v1.6.13
The Azure CNS pods that run on windows nodes work for a limited amount of time, and then lose connection to the K8S api and start issuing "Failed to list" and "Failed to watch" errors:
To Reproduce Setup a cluster with Pod Subnet and place pods and nodes in different subnets. Make sure you add a windows nodepool.
After several minutes of working, the CNS pods in the windows nodes will start to fail. I presume that the impact of this is that networking is not updated when pods are rescheduled on the windows nodes.
Expected behavior No API authentication errors.
Screenshots
Environment (please complete the following information):
Additional context