Bug: Reconciling VirtualNetworksSubnet fails with "Request entity too large: limit is 3145728"

danilo404 commented 1 week ago

Describe the bug

The bug manifests on our cluster created with the following networking parameters:

az aks show --subscription ExampleSubscription -n example-cluster-name -g example-cluster-name-rg -o table --query networkProfile

NetworkPlugin    NetworkPolicy    NetworkDataplane    ServiceCidr    DnsServiceIp    OutboundType    LoadBalancerSku    PodLinkLocalAccess
---------------  ---------------  ------------------  -------------  --------------  --------------  -----------------  --------------------
azure            azure            azure               10.100.0.0/16  10.100.0.10     loadBalancer    standard           IMDS

And it has 20 Agent Pools, with the following sizes:

 az aks show --subscription ExampleSubscription -n example-cluster-name -g example-cluster-name-rg -o table --query "agentPoolProfiles[].{Count: count, maxCount: maxCount, maxPods: maxPods}"
Count    MaxCount    MaxPods
-------  ----------  ---------
0        3           20
5        7           150
0        3           80
2        5           110
0        50          100
36       60          100
27       100         100
12       20          100
11       33          110
1        4           110
3        8           80
0        0           100
5        10          100
2        7           100
4        30          100
5        30          100
0        3           20
15       30          100
3        3           20
2        7           80

CAPZ created a VirtualNetworksSubnet ASO CR for that cluster with the following configuration:

az network vnet subnet show --ids "example/subnet/id" -o table --query "{addressPrefix: addressPrefix, privateEndpointNetworkPolicies: privateEndpointNetworkPolicies, privateLinkServiceNetworkPolicies: privateLinkServiceNetworkPolicies}"

AddressPrefix    PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies
---------------  --------------------------------  -----------------------------------
10.0.0.0/16      Disabled                          Enabled

When the AgentPools reach somewhere close to the "counts" above, the VirtualNetworksSubnet object in azure grows in size to around 5.6mb, if fills up with thousands of entries in the ipConfigurations field:

az network vnet subnet show --ids /subscriptions/.../subnets/example-cluster-subnet > example-cluster-subnet.json
ls -lh example-cluster-subnet.json
-rw-r--r--@ 1 danilo.uipath  staff   5.6M Nov  4 12:37 example-cluster-subnet.json
cat example-cluster-subnet.json| jq '.ipConfigurations | length'
14006
cat example-cluster-subnet.json| jq '.ipConfigurations[0].id | length'
305
cat example-cluster-subnet.json| jq '.ipConfigurations[0].resourceGroup | length'
60

ASO then tries to persist the ipConfigurations into the VirtualNetworksSubnet CR's status and this causes the api server to return:

E1107 10:21:00.621890       1 generic_reconciler.go:143] "msg"="Failed to commit object to etcd" "error"="updating example-ns/example-cluster-name-vnet-example-cluster-name-subnet resource: Request entity too large: limit is 3145728" "logger"="controllers.VirtualNetworksSubnetController" "name"="example-cluster-name-example-cluster-name-subnet" "namespace"="example-ns"

Azure Service Operator Version: v2.8.0

Expected behavior

The VirtualNetworksSubnet to continue reconciling successfuly for any scalable size of my Agent Pools.

To Reproduce

Create a VirtualNetworksSubnet CR for an Azure Cloud Subnet with a large number of ipConfigurations and wait for the controller to attempt to sync it.

Additional context

This issue relates to another issue in the CAPZ project https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/4649

matthchr commented 2 days ago

Can you share what the spec for the subnet looks like, as managed by CAPZ?

matthchr commented 2 days ago

I think the issue we've got here is the fact that there are 14k entries for the ipConfigurations field (which Azure allows), but at some point you cross the Kubernetes boundary for max resource size.

There is also a max resource size boundary for Azure I believe, but I think it's 4mb not 1.5mb which AFAIK is the default on Kubernetes.

danilo404 commented 2 days ago

Can you share what the spec for the subnet looks like, as managed by CAPZ?

AMCP resource:

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
spec:
  virtualNetwork:
    cidrBlock: 10.0.0.0/16
    name: example-cluster-vnet
    resourceGroup: example-cluster-rg
    subnet:
      cidrBlock: 10.0.0.0/16
      name: example-cluster-subnet
      serviceEndpoints:
        - locations:
            - '*'
          service: Microsoft.Sql
        - locations:
            - '*'
          service: Microsoft.KeyVault
        - locations:
            - '*'
          service: Microsoft.Storage
        - locations:
            - '*'
          service: Microsoft.AzureCosmosDB
        - locations:
            - '*'
          service: Microsoft.ServiceBus
        - locations:
            - '*'
          service: Microsoft.EventHub

And the Subnet it creates:

apiVersion: network.azure.com/v1api20201101
kind: VirtualNetworksSubnet
spec:
  addressPrefix: 10.0.0.0/16
  addressPrefixes:
  - 10.0.0.0/16
  azureName: example-cluster-subnet
  owner:
    name: example-cluster-vnet
  serviceEndpoints:
  - locations:
    - '*'
    service: Microsoft.Sql
  - locations:
    - '*'
    service: Microsoft.KeyVault
  - locations:
    - '*'
    service: Microsoft.Storage
  - locations:
    - '*'
    service: Microsoft.AzureCosmosDB
  - locations:
    - '*'
    service: Microsoft.ServiceBus
  - locations:
    - '*'
    service: Microsoft.EventHub

matthchr commented 1 day ago

I looked at this some more and I think this comes down to a mismatch between the allowed max size of an Azure resource (which is I think somewhere in the 4mb range) and the allowed max size of a Kubernetes resource, which is ~1.5mb.

Since we fundamentally cannot fit this much data into etcd, there's not really much we can do here other than elide the .status.ipConfigurations after some maximum length. The only thing that makes me feel any better about that is the fact that it's probably not practically possible to really use a list of 14000 ipConfiguration ARM IDs for anything anyway.

@nojnhuh - is CAPZ using .status.ipConfigurations for anything right now?

nojnhuh commented 1 day ago

@nojnhuh - is CAPZ using .status.ipConfigurations for anything right now?

It is not, so however you handle that should work for CAPZ.

danilo404 commented 1 day ago

Hey @matthchr, thanks so much for looking into this. Irt the etcd limit, the problem seems to manifest in different ways depending on the size of the object in Azure. Note that in the original ticket I opened in CAPZ, the error was different and it came from etcd:

E0315 17:13:54.206966       1 controller.go:329] "msg"="Reconciler error" "error"="updating mynamespace/examplecluster-vnet-examplecluster-subnet resource status: etcdserver: request is too large" "logger"="controllers" "name"="examplecluster-vnet-examplecluster-subnet" "namespace"="examplenamespace" "reconcileID"="..."

In that case, also note that the Subnet was not as large, when the error was observed, the subnet size was around 2.9mb.

Now the subnet object in Azure reached around 5.6mb and the error seems to come from the Kubernetes API server itself, this limit is hardcoded in more than on place, e.g. here.

I think in this case the object did not reach etcd.

matthchr commented 1 day ago

Thanks @danilo404 - I suppose a more precise phrasing of the problem is not so much etcd but: Azure allows larger resources than Kubernetes. I think once the etcd limit is crossed it won't work in k8s, though I didn't know about the hardcoded apsierver limit that ends up giving a different error if the request gets large enough.

matthchr commented 1 day ago

In terms of plan to fix this, it didn't make 2.11.0 (which has already shipped). I think we can try getting a fix merged before most of us go on holiday, which could enable consumption of the fix via the experimental release, but official release will probably need to wait until next year. There's also the added wrinkle of CAPZ using a slightly older version of ASO which may delay uptake in vanilla CAPZ as well.

Unfortunately I don't really see a workaround for this problem other than "keep the cluster small" in the meantime, though possibly this issue isn't actually breaking things severely if CAPZ isn't trying to update the subnet?

Can you share what the impact is to you @danilo404, and if you have any workaround to it currently?

danilo404 commented 7 hours ago

Thanks for the update @matthchr. We don't have workarounds for this case, but the impact for now is not blocking. What happens now is that the CAPZ object AzureManagedControlPlane reconcile loop tries to sync the Subnetwork's status (even without changes to the spec) and the CAPI/CAPZ Cluster stays in a Failed state in Kubernetes, but the cluster itself in Azure is healthy. In any case the experimental release would be really useful, because the AMCP in 'failed' state causes other headaches, like the Flux orchestration that is unable to progress, and related alerts' silencing etc.

Azure / azure-service-operator