Azure / ACS

Azure Container Service - Bug Tracker + Announcements
65 stars 27 forks source link

"Updating Kubernetes version and agent node scaling are mutually exclusive operations" #82

Closed bremnes closed 7 years ago

bremnes commented 7 years ago

Is this a request for help?: Yes, solutions or workarounds would be appreciated


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kubernetes v1.7.5 Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

What happened: When attempting to scale out an existing agent pool through the Azure Portal the following error appeared:

Failed to save container service 'MyContainerService'. Error: An error has occurred in subscription 11111111-2222-3333-4444-555555555555, resourceGroup: MyContainerServiceRg request: Provisioning of resource(s) for container service MyContainerService in resource group MyContainerServiceRg failed. Message: { "code": "OperationNotAllowed", "message": "Updating Kubernetes version and agent node scaling are mutually exclusive operations" }

What you expected to happen: The agent pool being scaled out using the existing version. An newer version being available shouldn't stop the operation of an existing ACS cluster. In this case we didn't even know that an upgrade was available (nor how to actually upgrade).

How to reproduce it (as minimally and precisely as possible): Use an ACS Kubernetes cluster which hasn't been upgraded to the newest version. Through the Azure Portal navigate to the agent pool. Then try to change the vm count field followed by "Save".

Anything else we need to know: Issue #13 points out the need for improved documentation. If we are required to upgrade before scaling out, it would be nice to get a recipe or more details on how to do that. Some more information if that helps:

JackQuincy commented 7 years ago

Thank you for reporting this. The current work around would be to use the az cli. It does a GET then PUT so it shouldn't change the version. If you get the same error you can work around it by doing an az acs scale with the agent count being your current agent count. That should trigger and upgrade. Then you can scale when that is done. Upgrade is time consuming. Taking about 10 mins per node to be upgraded.
I'll report back here with info as fixes come in.

JackQuincy commented 7 years ago

Of you were wondering the reason I have two paths is it depends on which version of the az cli you have.

bremnes commented 7 years ago

Thanks for the quick reply, Jack.

Doesn't seem like az acs scale supports multiple agent pools, but I tried with different input to the --new-agent-count parameter. All ended up in a similar error message: az acs scale -g MyContainerServiceRg -n MyContainerService --new-agent-count 2

An error has occurred in subscription 11111111-2222-3333-4444-555555555555, resourceGroup: MyContainerServiceRg request: OrchestratorProfile has unknown orchestrator version: 1.7.5.

Using the --debug flag I can see that it does a GET then PUT like you said. But it ends up in a 400 message like described above. My CLI version is 2.0.20 if it makes any difference.

From the Request body being sent: {"location": "ukwest", "properties": {"orchestratorProfile": {"orchestratorType": "Kubernetes", "orchestratorVersion": "1.7.5" ....

JackQuincy commented 7 years ago

Hmm... #81 is getting in the way. A older version of the cli might work. We updated the api version calls to the service use. An older cli would behave more like portal. And the work around might work.

81 is code complete we just need to update the service. Which we should be able to do next week.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Lars Kristian Bremnes notifications@github.com Sent: Saturday, October 28, 2017 11:52:01 AM To: Azure/ACS Cc: Jack Quincy; Comment Subject: Re: [Azure/ACS] "Updating Kubernetes version and agent node scaling are mutually exclusive operations" (#82)

Thanks for the quick reply, Jack.

Doesn't seem like az acs scale supports multiple agent pools, but I tried with different input to the --new-agent-count parameter. All ended up in a similar error message: az acs scale -g MyContainerServiceRg -n MyContainerService --new-agent-count 2

An error has occurred in subscription 11111111-2222-3333-4444-555555555555, resourceGroup: MyContainerServiceRg request: OrchestratorProfile has unknown orchestrator version: 1.7.5.

Using the --debug flag I can see that it does a GET then PUT like you said. But it ends up in a 400 message like described above. My CLI version is 2.0.20 if it makes any difference.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2FACS%2Fissues%2F82%23issuecomment-340212095&data=02%7C01%7Cjaquincy%40microsoft.com%7Cab7cb404f7c94713258608d51e34f986%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636448135235961110&sdata=mKfTdLdxpGe7juaCKSHgbOgvX2V%2FaebiUrnwoEJCfzs%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUBiNCi-WpyvyHwm3GPPjxhkipm8cwAnks5sw3fQgaJpZM4QKAcd&data=02%7C01%7Cjaquincy%40microsoft.com%7Cab7cb404f7c94713258608d51e34f986%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636448135235961110&sdata=yn9BEkyZM3%2FSVnPGaiR31a87evBY0um9Repte8VlS7U%3D&reserved=0.

bremnes commented 7 years ago

@JackQuincy, FYI we tried with an older cli (2.0.18) and then we got another error:

An error has occurred in subscription 11111111-2222-3333-4444-555555555555, resourceGroup: MyContainerServiceRg request: Provisioning of resource(s) for container service MyContainerService in resource group MyContainerServiceRg failed. Message: { "code": "PropertyChangeNotAllowed", "message": "Changing property 'agentPoolProfile.OSType' is not allowed.", "target": "agentPoolProfile.OSType" }

I'm assuming that is related to the fact that we have a Linux agent pool and a Windows agent pool and az acs scale doesn't fully support multiple, mixed agent pools?

JackQuincy commented 7 years ago

Sorry just saw this. Yeah the old CLI doesn't understand ostype and our code is interpreting absence as linux on a side note This fix is now rolling out. We are in about 50% of our datacenters

bremnes commented 7 years ago

Tried again through the portal to see if the change had cascaded to "our" data center, got another error now:

Failed to save container service 'MyContainerService'. Error: An error has occurred in subscription 11111111-2222-3333-4444-555555555555, resourceGroup: MyContainerServiceRg request: Provisioning of resource(s) for container service MyContainerService in resource group MyContainerServiceRg failed. Message: { "code": "InvalidParameter", "message": "The value of parameter orchestratorProfile.OrchestratorVersion is invalid.", "target": "orchestratorProfile.OrchestratorVersion" }

JackQuincy commented 7 years ago

What datacenter is this? @bremnes

bremnes commented 7 years ago

It's UK West @JackQuincy. I get the same error using CLI version 2.0.20. From the PUT call with --debug:

msrest.http_logger : {"location": "ukwest", "properties": {"orchestratorProfile": {"orchestratorType": "Kubernetes", "orchestratorVersion": "1.7.5"}, "masterProfile": ...

If it helps you dig out some information from the logs on your side I just ran the command and got this in the response header.

'x-ms-correlation-request-id': '58622c86-fcca-428b-887c-981bef5b3adc' 'x-ms-request-id': '70f27aaf-2ab2-45a3-b7e2-12a827fc55ee' 'Date': 'Mon, 06 Nov 2017 17:15:54 GMT'

JackQuincy commented 7 years ago

I found the issue... I'm really sorry about how long this is taking to resolve. I have a code fix and we are testing it in validation environments currently.

bremnes commented 7 years ago

The feedbacks are coming quickly, so not a problem at all @JackQuincy .

To throw another wrench into this scenario: through the portal I attempted to show this bug-in-progress to my colleague, but lo and behold - it didn't immediately threw an exception. Instead it went on its merry way to actually scale out the cluster.

The good: at first glance the scale out apparently seemed to be working. The container service resource reports OK on the scale out and displays "2" instead of "1" for the agent pool I tried to scale out. Inspecting the managed acs resource group reveals an additional VM which has been put into the availability set as well.

The bad: the cluster is no longer accessible. kubectl proxy starts fine, but when trying to browse the dashboard it gives the error message Unable to connect to the server: dial tcp 51.141.XXX.YYY:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. Same goes for kubectl cluster-info or any other kubectl command. (The IP address is the master static public IP address.)

The activity log for the container service resource shows "Write ContainerServices" with status "Accepted". Displaying the activity log for the managed acs resource group shows this: acs-broken-scale-out (I have no idea why it's using a previously deleted container service name in the right column and not the alive container service it's actually doing the actions on.)

Waited for some time to see if anything more would happen, but it seems like this is the "finished" state. The "Write Routes" that fails is for the 'k8s-master-12345678-routetable' and underlying route 'k8s-master-12345678-0'. The properties property of the activity shows

"statusCode": "429", "serviceRequestId": "a4a57bed-38d3-4f01-a8ea-5499edd42541", "statusMessage": "{\"error\":{\"code\":\"RetryableError\",\"message\":\"A retryable error occurred.\",\"details\":[{\"code\":\"RetryableErrorDueToAnotherOperation\",\"message\":\"Operation PutRouteOperation (c80a3395-38a3-43bb-9ce5-a2d3e8a8c385) is updating resource /subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/MyContainerServiceRg_MyContainerService_ukwest/providers/Microsoft.Network/routeTables/k8s-master-12345678-routetable. The call can be retried in 11 seconds.\"}]}}"

This is a test cluster, so fortunately we can create another one if it's not fixable. But we are planning to go into production at some time, where ideally stuff like this doesn't happen. Any thoughts or should we just create a support request to Microsoft where they can dig into the details? Or alternatively just scrap the entire cluster.

JackQuincy commented 7 years ago

If you are doing this with the az cli it creates and caches a service principal that is reused with future clusters. Kubernetes uses this to modify some of the networking resources. That is probably why the old cluster name is there. I'd expect the master to retry that route configuring every couple minuets. The route not being there could definitely cause the behavior you are seeing. If it isn't coming back I'd restart the master vm and see if the route creation succeeds this time, the kublet on the node should try to create it as it comes online. This is the first time I've seen an actual error confirming that we are wiping out the route table. I'd seen things that suggest it, but usually Kubernetes just fixes it almost instantly afterwards so no one notices. If restarting the vm node doesn't work another scale operation should.

bremnes commented 7 years ago

It didn't automatically fix itself overnight, so I tried the following: 1) Restart the master (linux) server. Waited an hour after doing so, still nothing besides the same error message on kubectl commands Unable to connect to the server: dial tcp 51.141.XXX.YYY:443: ... 2) Scale one of the node pools from 2 to 3 nodes. The additional server came up, but still no connection. Waited a couple of hours, but still the same error. No failed lines in the activity log, but some activities were never completed - RouteTables and InboundNatRules. Only a BeginRequest activity with no responding EndRequest. See image: acs-broken-scale-out-second-attempt 3) Not really expecting other results, a final attempt on scaling down to the original node count, 1. Still the same error.

I also restarted the master again, but to no avail. If you have any more ideas, let me know.

JackQuincy commented 7 years ago

So I tried to look at that same call and I'm seeing a success on route tables being written to. I'm guessing this is a red herring. we are seeing issues with etcd not recovering well from a restart that happened when azure rolled out updates on the host. We can be sure if you will look at the routetable in the resource group. Does it have a route for the master node?

JackQuincy commented 7 years ago

We should have a new git hub issue on this soon but can you try running this on your master and see if it fixes it?

sudo /bin/sed -i s/Restart=on-abnormal/Restart=always/g /lib/systemd/system/etcd.service systemctl daemon-reload

bremnes commented 7 years ago

@JackQuincy Regarding your first question: the routetable doesn't contain any routes at all. This must have happened in the latest scaling attempt as they were present in yesterdays comment.

For the commands you gave me: not sure what was supposed to happen, but it doesn't seem like it affected anything. Still gave the same error as earlier. For good measure I tried restarting the server as well. Let me know if there are any logs or anything else that can help shed some light on the situation.

JackQuincy commented 7 years ago

:( was hoping that would fix it. Can you run this on the master, "sudo journalctl -u kubelet " in there we should be able to get logs and see what is the failure. I'm not sure if it contains any info you will want to scrap, you can either send my a private gist or email at jaquincy@microsoft.com

bremnes commented 7 years ago

I'm not sure of the content, so I've sent it privately for now. If there are any clues in there @JackQuincy is free to share the relevant information here so that others might learn from our situation.

I'm assuming this is quite relevant from the file, but I'll leave it up to Jack to draw some conclusions:

Error: invalid argument "" for --max-pods=: strconv.ParseInt: parsing "": invalid syntax

bremnes commented 7 years ago

I've been in touch with Jack via email. Based on the error above regarding --max-pods he pointed me in the direction of the file /etc/default/kubelet on the master node:

KUBELET_CLUSTER_DNS=10.0.0.10
KUBELET_API_SERVERS=https://10.240.255.5:443
KUBELET_IMAGE=gcrio.azureedge.net/google_containers/hyperkube-amd64:v1.7.5
KUBELET_NETWORK_PLUGIN=kubenet
KUBELET_MAX_PODS=
...

As Jack, and the error, pointed out that KUBELET_MAX_PODS should have a value. After being advised setting it to 110 and restarting the server, we now get another error:

Nov 09 23:50:07 k8s-master-12345678-0 systemd[1]: Started Kubelet. Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: 2017-11-09 23:50:09.125154 I | proto: duplicate proto type registered: google.protobuf.Any Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: 2017-11-09 23:50:09.125705 I | proto: duplicate proto type registered: google.protobuf.Duration Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: 2017-11-09 23:50:09.125730 I | proto: duplicate proto type registered: google.protobuf.Timestamp Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: I1109 23:50:09.222704 1614 feature_gate.go:144] feature gates: map[] Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: Error: failed to run Kubelet: could not init cloud provider "azure": No credentials provided for AAD application Nov 09 23:50:09 k8s-master-12345678-0 docker[1565]: Error: failed to run Kubelet: could not init cloud provider "azure": No credentials provided for AAD application Nov 09 23:50:09 k8s-master-12345678-0 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Nov 09 23:50:09 k8s-master-12345678-0 systemd[1]: kubelet.service: Unit entered failed state. Nov 09 23:50:09 k8s-master-12345678-0 systemd[1]: kubelet.service: Failed with result 'exit-code'. Nov 09 23:50:09 k8s-master-12345678-0 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.

For what it's worth: we haven't touched the underlying infrastructure as we saw that as a part of the managed Azure Container Services (before touching the max-pods setting of course). So the theory is still Kubernetes and/or ACS being at fault, somehow dropping/misconfiguring the configuration.

JackQuincy commented 7 years ago

@bremnes we had at least one bug around us affect us a lot last week. It seems like there is another bug, possibly ours. I got logs from one of your machines pulled and the behavior is a little strange and was wanting to get you to decrypt a settings file for me to help figure out where this bug lies.


sudo -s

cd /var/lib/waagent

THUMBPRINT=$(xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.3.xml | \
 jq -r .runtimeSettings[].handlerSettings.protectedSettingsCertThumbprint)

xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.3.xml | \
 jq -r .runtimeSettings[].handlerSettings.protectedSettings | \
 base64 -d | \
 openssl smime -inform DER -recip $THUMBPRINT.crt -inkey $THUMBPRINT.prv -decrypt | \
 jq .

That should output something like this

{
  "commandToExecute": "{a bunch of environment variable setup} /usr/bin/nohup /bin/bash -c \"/bin/bash /opt/azure/containers/provision.sh >> /var/log/azure/cluster-provision.log 2>&1\""
}

I'm guessing your SERVICE_PRINCIPAL_CLIENT_ID and/or SERVICE_PRINCIPAL_CLIENT_SECRET variables are being set to empty. I think your MAX_PODS also get set to nothing. I just want to confirm before I chase that thread to figure out why things got nulled out.

bremnes commented 7 years ago

@JackQuincy To be absolutely sure, do you want me to run the commandToExecute as sudo? I had to install xml-twig-tools to be able to generate the commandToExecute. I see that it contains guids and resource names for some of the ACS resources, in addition to a couple of base64 encoded private keys. Just want to be certain that the commandToExecute wasn't supposed to contain those variables (because it doesn't).

JackQuincy commented 7 years ago

the sudo -s is because you can't navigate or read /var/lib/waagent/ without it.

The commandToExecute should of had all of those variables and if it doesn't but still ended "/usr/bin/nohup /bin/bash -c \"/bin/bash /opt/azure/containers/provision.sh >> /var/log/azure/cluster-provision.log 2>&1" then that will certainly be the issue. I'm going to go try to chase down how/why our extension lost some of it parameters. Seems like there was a bug with keeping that script and it got messed up for some reason. The base64 encoded bits are private keys for running your kubernetes cluster so I wouldn't post. But I'd be curious on what the list of variable names that are still being initialized are.

Each version of the script is downloaded into a new file. At the same time as your issues with max pods etc. started a new version was downloaded and ran.

On a side note we have taken steps to prevent this in the future and new clusters shouldn't have these issues ever even if similar bugs pop up in our dependencies after an incoming deployment. Stay tuned and I'll update when we have updated our service.

bremnes commented 7 years ago

Thanks for the quick reply @JackQuincy. I've added the output, except the private keys, to the gist that I've shared with you. Let me know if there is anything else I should check.

JackQuincy commented 7 years ago

Talking with @bremnes I was able to figure out what happened. ACS changed how we call the provision script in our CSE to prevent a class of issues/bugs we've had for a while. We changed from passing arguments based on position so if one was missing it got all indexes off by one, to setting variables by name. This was good and works. The problem is we weren't removing the command to run it on masters in scaling clusters. We also didn't download a new provision script in these cases. This caused the script to become destructive.

To fix a broken cluster please run this bash command on each master

sudo -s
apt install install xml-twig-tools
cd /var/lib/waagent
 THUMBPRINT=$(xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml | \
  jq -r .runtimeSettings[].handlerSettings.protectedSettingsCertThumbprint)
xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml |\
  jq -r .runtimeSettings[].handlerSettings.protectedSettings |\
  base64 -d |\
  openssl smime -inform DER -recip $THUMBPRINT.crt -inkey $THUMBPRINT.prv -decrypt |\
  jq -r .commandToExecute |\
  bash

This will reset the state to what it was when the cluster was provisioned and should get it back to a good state.

Edit: updating script to point to first version that would have our command instead of version that is there on startup before download.

bremnes commented 7 years ago

@JackQuincy Our ExtensionsConfig.1.xml file didn't contain the Plugin element, so unfortunately that script didn't work for us. (See gist for what the file actually contained.)

JackQuincy commented 7 years ago

Interesting my bad I did the script on ExtensionsConfig.2.xml and had assumed 1 was just an older version and I wanted the oldest. I'll inline update the script to use 2.

bremnes commented 7 years ago

When running the command with ExtensionsConfig.2.xml, the output was: /usr/bin/nohup: appending output to 'nohup.out'

I'm not sure what was supposed to happen. /usr/bin/nohup contains what seems to be gibberish and nohup.out (in the /var/lib/waagent folder) is empty.

JackQuincy commented 7 years ago

Sounds like it ran this time. That master should be working again. If you run kubectl get no, it should now return.

bremnes commented 7 years ago

After running the last command with ExtensionsConfig.2.xml the cluster started being accessible again: we could reach the publicly exposed services in addition to do kubectl commands.

Even though the cluster seemed healthy, I wanted to get this confirmed by doing a new scale operation - from 1 to 3 nodes in the windows node pool . Through the Azure portal the container service reported back that the scale operation went OK. The services are still available and I can still do kubectl operations.

But looking at the activity log for the managed resource group and the state of the nodes, is doesn't seem like the cluster is 100% healthy:

>kubectl get nodes
NAME                       STATUS     AGE       VERSION
12345acs9000               Ready      45d       v1.7.5-7+25600793e1bcfa
12345k8s9001               Ready      1d        v1.7.5-30+77464b0a319726
12345k8s9002               Ready      1d        v1.7.5-30+77464b0a319726
k8s-linuxpool-12345678-0   Ready      45d       v1.7.5
k8s-linuxpool-12345678-1   Ready      45d       v1.7.5
k8s-master-12345678-0      NotReady   45d       v1.7.5

acs-round2-scale-out-attempt 19:19 is when the linux command given earlier from @JackQuincy was run, 20:58 approximately when the scale was done through the Azure Portal.

JackQuincy commented 7 years ago

So the fix for scaling is finished rolling out globally yesterday. So it is possible that when you scaled up the bad bits were still on the datacenter you where in. I'd rerun the script and try to scale again. Everything should be fixed now.

bremnes commented 7 years ago

Yeah, I assumed that was the solution, just wanted to get it confirmed before doing any more manual operations on the cluster.

Running the command quickly flipped the status of the master agent to Ready like all the other nodes.

Once again I tried to scale through the Portal and got another error. But it still went ahead and scaled the cluster.

Failed to save container service 'MyContainerService'. Error: No registered resource provider found for location 'ukwest' and API version '2016-03-30' for type 'locations/operations'. The supported api-versions are '2015-11-01-preview, 2016-03-30, 2016-09-30, 2017-01-31, 2017-07-01, 2017-08-31'. The supported locations are 'japaneast, centralus, eastus2, japanwest, eastasia, southcentralus, australiaeast, australiasoutheast, brazilsouth, southeastasia, westus, northcentralus, westeurope, northeurope, eastus, uksouth, westcentralus, southindia, centralindia, westindia, canadaeast, canadacentral'.

The cluster works and everything seems okay. But I'm not a fan of how the activites in the activity log just states "started" as it leaves an uncertainty that not everything is 100%. For instance for "Delete Routes" which only has a BeginRequest - is it really completed or could there be any potential problems down the road?

Anyways, thanks for helping out on this issue @JackQuincy, I really appreciate it! Unless the activity thing above is related to this issue, I assume it can be closed.

JackQuincy commented 7 years ago

@bremnes the activity thing isn't related. I see you opened an issue for it. The deployment error you are seeing about api version is more concerning. I found the error it is in configuration and should be fixed today.

JackQuincy commented 7 years ago

@amanohar here is how I found the manifest issue