Closed fozturner closed 1 month ago
@wedaly
Additional info as requested in from issue
k8s version - 1.27.9 networkPlugin - "Azure" networkPluginMode "null" podSubnetId (if any)
We have 1 subnet for the app pods and one for system pods, these are delegated to Microsoft.ContainerService/managedClusters
"/subscriptions/{redacted}/resourceGroups/{redacted}/providers/Microsoft.Network/virtualNetworks/{redacted}/subnets/akssyspod-uat-uks-snet"
"/subscriptions/{redacted}/resourceGroups/{redacted}/providers/Microsoft.Network/virtualNetworks/{redacted}/subnets/aksapppod-uat-uks-snet"
I see that you have 15-azure-swift.conflist
, so I think everything worked as expected from our end here. What specifically is the issue with 10-azure.conflist
not being present? The CRI will load the conflist correctly regardless of name, and networking should be functional here.
I see that you have
15-azure-swift.conflist
, so I think everything worked as expected from our end here. What specifically is the issue with10-azure.conflist
not being present? The CRI will load the conflist correctly regardless of name, and networking should be functional here.
The issue is with CNIs like Kuma mesh it needs to know the CNI config file to chain to. If the file is not present then it fails as these conflist files are different between our clusters.
On one cluster the file is 10-azure.conflist
another simply has 15-azure-swift.conflist
despite using the same config and Azure CNI version. The only change was updating the node so we are trying to understand if there was a change to the behaviour and understand what component changed to cause this.
The only thing from our pov that changed between environments was an update to the image used by the node.
If we need to configure Kuma to use the swift conflist we can change this config but would like to understand what caused the change to happen and what controls it. Mainly so we can pre-empt issues in the future by checking release notes etc.
The relevant docs from Kong/Kuma https://docs.konghq.com/mesh/latest/production/dp-config/cni/
kumactl install control-plane \
--set "kuma.cni.enabled=true" \
--set "kuma.cni.chained=true" \
--set "kuma.cni.netDir=/etc/cni/net.d" \
--set "kuma.cni.binDir=/opt/cni/bin" \
--set "kuma.cni.confName=10-azure.conflist" \
| kubectl apply -f -
@james-bjss I don't think AKS makes any guarantees of support for CNI chaining, but if that's documented somewhere, point me to it. The conflist name change doesn't break AKS networking, there isn't a contract being violated here between CNI and the CRI, and it's not a bug.
You may be able to mitigate by changing the configuration for the conflist name in your other CNI. Note that the AzCNI conflist may vary between AKS versions, base images, and AKS network modes. Ex it is 10-azure.conflist
for node subnet, but 15-azure-overlay.conflist
for Overlay mode. Also note that if you are mutating the conflist, AKS may reconcile it back to the target state periodically so this may not be a viable long term solution, depending on how your chaining plugin handles that.
@james-bjss I don't think AKS makes any guarantees of support for CNI chaining, but if that's documented somewhere, point me to it. The conflist name change doesn't break AKS networking, there isn't a contract being violated here between CNI and the CRI, and it's not a bug.
You may be able to mitigate by changing the configuration for the conflist name in your other CNI. Note that the AzCNI conflist may vary between AKS versions, base images, and AKS network modes. Ex it is
10-azure.conflist
for node subnet, but15-azure-overlay.conflist
for Overlay mode. Also note that if you are mutating the conflist, AKS may reconcile it back to the target state periodically so this may not be a viable long term solution, depending on how your chaining plugin handles that.
Thanks @rbtr. I suppose we were trying to ascertain if this was expected behaviour on not. Now we know that there is no guarantee that this conflist will be present I guess the next thing to do is to report it to the Kuma project and see what they advise.
FWIW their docs for GKE basically say to do this (check what the conflist is named, and use that name):
I have some additional information on this.
Firstly, if we do use the 15-azure-swift.conflist file in our service mesh the pods falls into a CrashLoopBackOff.
But after some further testing it appears that if I create a new cluster and do NOT set the --pod-subnet-id
, then the nodes do have the 10-azure.conflist
file present. If I do set the --pod-subnet-id
then the nodes have the 15-azure-swift.conflist
file present instead.
It appears setting the --pod-subnet-id
sets configuration to deploy the azure-dns daemonset.
The clusters we are running have had the --pod-subnet-id
set since day 1 and have been running quite happily, so whatever has/is being implemented for the "swift" configuration is breaking something.
With everything else equal, this flag controls whether the cluster is provisioned as legacy Node Subnet (without) or Dynamic Pod Subnet (with). There will be other effects besides the different conflist path; for example, Node Subnet mode permanently reserves MaxPods
(default: 30
) IPs out of the subnet per Node
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Issue needing attention of @Azure/aks-leads
This issue will now be closed because it hasn't had any activity for 7 days after stale. fozturner feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.
Describe the bug Following updating AKS node images to version
AKSCBLMariner-V2gen2-202405.20.0
we have noticed that there is a CNI conflist file/etc/cni/net.d/10-azure.conflist
that is now missing from the new nodes.This issue is present on the even newer
AKSCBLMariner-V2gen2-202405.27.0
node image too.To Reproduce Steps to reproduce the behavior:
Deploy AKS cluster with CNI enabled, for context I used this command:
az aks create --resource-group $resourcegroup --name $aksclustername --outbound-type userAssignedNATGateway --aad-tenant-id <tenant_id> --enable-aad --enable-azure-rbac --enable-oidc-issuer --enable-workload-identity --max-pods 50 --network-plugin azure --node-count 2 --node-vm-size Standard_D2s_v3 --os-sku AzureLinux --pod-subnet-id <system_pod_subnet_id> --vnet-subnet-id <system_node_subnet_id> --api-server-authorized-ip-ranges <ips-to-whitelist> --tier free --dns-service-ip 10.0.0.10 --kubernetes-version "1.27.9"
Log into node
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
Navigate to cni directory and list contents
10-azure.conflist file not present.
Expected behavior I would expect there to be an
10-azure.conflist
file present in the/host/etc/cni/net.d
directoryScreenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Other clusters running old node versions still have the 10-azure.conflist file present. Region: UK South
Related to Azure/azure-container-networking/issues/2779 and /Azure/AgentBaker/issues/4499