Azure / acs-engine

WE HAVE MOVED: Please join us at Azure/aks-engine!
https://github.com/Azure/aks-engine
MIT License
1.03k stars 561 forks source link

Upgrade results in node with 111 IPs #2668

Closed EPinci closed 5 years ago

EPinci commented 6 years ago

Is this a request for help?: Yes


Is this an ISSUE or FEATURE REQUEST? ISSUE


What version of acs-engine?: v15.1


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) from v1.9.3 to v1.9.6

What happened:

I configured the cluster with AzureCNI and ipAddressCount set to 20. During an upgrade run node get thorn down and rebuilt. Original node had 20 IPs as expected, new node has 111 IPs resulting in quick subnet exhaustion.

What you expected to happen:

Original node having 20 IP, new node having the same number.

How to reproduce it (as minimally and precisely as possible):

Deploy a cluster with with AzureCNI and ipAddressCount set to 20. Mine had 3 masters (with 20 IPs as well) and 3 node (with standard 30 IPs).

Anything else we need to know:

EPinci commented 6 years ago

@jackfrancis I filed this to track the issue found testing #2650 Any idea where I can start looking at?

jackfrancis commented 6 years ago

@EPinci will try to repro

jackfrancis commented 6 years ago

Running a deployment, then a series of upgrades against this api model:

{   "apiVersion": "vlabs",   "properties": {     "orchestratorProfile": {       "orchestratorType": "Kubernetes",       "orchestratorVersion": "1.7.0"     },     "masterProfile": {       "count": 1,       "dnsPrefix": "",       "vmSize": "Standard_D2_v2"     },     "agentPoolProfiles": [       {         "name": "agentpool1",         "count": 2,         "vmSize": "Standard_D2_v2",         "availabilityProfile": "AvailabilitySet",         "storageProfile" : "ManagedDisks"       }     ],     "linuxProfile": {       "adminUsername": "azureuser",       "ssh": {         "publicKeys": [           {             "keyData": ""           }         ]       }     },     "servicePrincipalProfile": {       "clientId": "",       "secret": ""     }   } }

After initial deployment:

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94
jackfrancis commented 6 years ago

Holding steady after the 1st upgrade (from 1.7.0 to 1.7.12):

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94
EPinci commented 6 years ago

Weird I tried this multiple times and always with the same result (even the IP count!). In my case had 3 masters and 3 nodes and then manually deleted two nodes VM form the portal.

Can you try just deleting one of the two nodes to see if this has an impact: this is exactly my scenario where the original node count get changed outside ACS-Engine (e.g.: autoscaler)?

Do you want me to send you my api model?

jackfrancis commented 6 years ago

Let's let my test keep running (there are 12 more upgrades to go). I'm not saying for sure that we can't repro yet. :)

jackfrancis commented 6 years ago

I take it back, I've been unable to repro. Yeah, please paste in the api model you're seeing this behavior on post-upgrade, and we'll repro using it as exactly as possible. Thanks!

EPinci commented 6 years ago

Ok, since I don't know what is actually relevant, this is the entire process I'm using to replicate upgrading my production.

On an Empty RG, deploy a local VNet (nothing fancy, just three /24):

call az network vnet create -g <<RGNAME>> -n K8sVNet --address-prefix 10.24.0.0/16

call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n master --address-prefix 10.24.250.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n frontend --address-prefix 10.24.1.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n backend --address-prefix 10.24.2.0/24

Compile the following apimodel with ACS-Engine 13.1 (not sure if binary version is relevant but this is the same as my current production cluster):

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.9",
      "kubernetesConfig": {
        "addons": [
            {
                "name": "tiller",
                "enabled" : false
            }
        ]
      }
    },
    "aadProfile": {
      "serverAppID": "<<REMOVED>>",
      "clientAppID": "<<REMOVED>>",
      "tenantID": "<<REMOVED>>",
      "adminGroupID": "<<REMOVED>>"
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "cluster-dev",
      "vmSize": "Standard_A1_v2",
      "storageProfile" : "ManagedDisks",
      "OSDiskSizeGB": 128,
      "firstConsecutiveStaticIP": "10.24.250.230",
      "ipAddressCount": 20,
      "vnetCidr": "10.24.0.0/16",
      "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/master"
  },
    "agentPoolProfiles": [
      {
        "name": "nodepool1",
        "count": 3,
        "vmSize": "Standard_A2_v2",
        "storageProfile" : "ManagedDisks",
        "OSDiskSizeGB": 128,
        "availabilityProfile": "AvailabilitySet",
        "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/frontend"       
      }
    ],
    "linuxProfile": {
      "adminUsername": "clusteradm",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "<<REMOVED>>"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "<<REMOVED>>",
      "secret": "<<REMOVED>>"
    }
  }
}

Then I deploy it:

az group deployment create -g <<RGNAME>> -n "cluster-dev" --template-file ".\_output\cluster-dev\azuredeploy.json" --parameters ".\_output\cluster-dev\azuredeploy.parameters.json"

This results in a three masters and three agents 1.9.3 cluster.

I then delete from the Azure portal the last two agents to simulate a non ACS-Engine aware cluster node count change such as what results from a cluster autoscaler. I manually cleanup also OS disks and NICs and verify that the agents are no longer listed in kubectl get nodes.

Run the upgrade with the current ACS-Engine:

acs-engine upgrade --subscription-id <<REMOVED>> ^
 --resource-group <<RGNAME>> --location westeurope ^
 --auth-method client_secret --client-id <<REMOVED>> --client-secret <<REMOVED>> ^
 --deployment-dir _output\cluster-dev --upgrade-version 1.9.6

Upgrade will delete the first master VM and redeploy it. After that, current build will stop due to the #2560 / #2061 but the redeployed node still has 111 IPs.

I can run a custom ACS-Engine build from HEAD with the small patch from #2061 and the upgrade continues with master 2 but then fails on master 3 with subnet full (3 x 111 = more than a /24 subnet).

Thank you.

EPinci commented 6 years ago

@jackfrancis Any chance you can give this a go? What do you think about it?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.