Open JackQuincy opened 7 years ago
To fix a broken cluster please run these bash commands on each of the master nodes
sudo -s
apt install install xml-twig-tools
cd /var/lib/waagent
THUMBPRINT=$(xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml | \
jq -r .runtimeSettings[].handlerSettings.protectedSettingsCertThumbprint)
xml_grep "//Plugin[@name='Microsoft.Azure.Extensions.CustomScript']/RuntimeSettings" --text_only ExtensionsConfig.2.xml |\
jq -r .runtimeSettings[].handlerSettings.protectedSettings |\
base64 -d |\
openssl smime -inform DER -recip $THUMBPRINT.crt -inkey $THUMBPRINT.prv -decrypt |\
jq -r .commandToExecute |\
bash
This should set the config back to original state and fix the cluster
Edit: changing script after trying as I realized I picked wrong file.
link to acs-engine change that started this for the curious. https://github.com/Azure/acs-engine/pull/1570
This is now rolled out to the service globally I'm leaving this open for a bit so that people who scaled their clusters while the bug was out can find the script to fix them.
Is this a request for help?:
NO Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REOPORT
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) Kuberentes
What happened: on scaling cluster stops working
What you expected to happen: cluster to continue working
How to reproduce it (as minimally and precisely as possible): Have an old cluster. Scale it up. Be broken.
Anything else we need to know: PR to fix root cause in ACS-Engine https://github.com/Azure/acs-engine/pull/1764
I'll try to get this merged and into the Service ASAP to fix this.