scaling old clusters can break kubernetes master in new rp regions -- fixed in service, script to fix broken clusters in issue

JackQuincy commented 7 years ago

Is this a request for help?:

NO Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REOPORT

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kuberentes

What happened: on scaling cluster stops working

What you expected to happen: cluster to continue working

How to reproduce it (as minimally and precisely as possible): Have an old cluster. Scale it up. Be broken.

Anything else we need to know: PR to fix root cause in ACS-Engine https://github.com/Azure/acs-engine/pull/1764
I'll try to get this merged and into the Service ASAP to fix this.

JackQuincy commented 7 years ago

To fix a broken cluster please run these bash commands on each of the master nodes

sudo -s
apt install install xml-twig-tools
cd /var/lib/waagent
THUMBPRINT=$(xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml | \
 jq -r .runtimeSettings[].handlerSettings.protectedSettingsCertThumbprint)
xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml |\
 jq -r .runtimeSettings[].handlerSettings.protectedSettings |\
 base64 -d |\
 openssl smime -inform DER -recip $THUMBPRINT.crt -inkey $THUMBPRINT.prv -decrypt |\
 jq -r .commandToExecute |\
 bash

This should set the config back to original state and fix the cluster

Edit: changing script after trying as I realized I picked wrong file.

JackQuincy commented 7 years ago

link to acs-engine change that started this for the curious. https://github.com/Azure/acs-engine/pull/1570

JackQuincy commented 7 years ago

This is now rolled out to the service globally I'm leaving this open for a bit so that people who scaled their clusters while the bug was out can find the script to fix them.

Azure / ACS

scaling old clusters can break kubernetes master in new rp regions -- fixed in service, script to fix broken clusters in issue #85

Is this a request for help?: