Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

existing cluster using non-functional VHD prevents VMSS scale out #2074

Closed PascalVA closed 4 years ago

PascalVA commented 4 years ago

Bug Description

When trying to add a new node pool to our existing cluster that was deployed with aks-engine v0.38.8 (aks-ubuntu-1804), we were unable to succeed in the deployment. The reason for this is that the Ubuntu base image had an expired GPG key for nvidia (VMExtensionProvisioningError, status=99). This was a hard problem to solve due to a few reasons:

1. There is no way to redeploy after manually fixing the issue on the nodes When we had a failed deployment and manually added the nvidia GPG key + performed the apt updates on the affected machines, redeploying with aks-enine on the existing machines just kept returning the same error as before we fixed it (VMExtensionProvisioningError, status=99).

2. There is no way (or at least not that I know of) to inject some commands to run before (or inside of) the cloud provisioner scripts when working with an existing cluster Whenever we tried to add the command to update the key as a hotfix in the cloud provisioning scripts, we received an error that we cannot update the customData. This probably due to the old nodes that were deployed with different init scripts.

3. Azure VM Images you can use are built in to the aks-engine binary. The last thing we tried was to use a newer image of aks-ubuntu-1804 to get around the key issue. To our surprise, the images you can use are built into the aks-engine binary. Perhaps we can add a way to use custom image SKU's and versions instead of baking them into the binary?

We managed to build a custom built aks-engine binary that had an image reference we called aks-engine-1804-patched that referenced a newer image from september. (which has different issues related to systemd-resolvd), but at least the deployment worked. I have put the diff (relative to the 0.38.8 release) in the additional context.

Steps To Reproduce

I don't think there is any easy way to reproduce the issue, because you would have to have an existing cluster of aks-engine v0.38.8 with the images of Ubuntu before the key was expired.

Expected behavior

We should have a way to keep deploying with older versions of aks-engine even if the base images baked into it break. Alternatively, we should be able to deploy with newer versions of aks-engine on older clusters and handle these compatibility issues.

AKS Engine version

v0.38.8

Kubernetes version

v1.13.10

Additional context

diff --git a/pkg/api/azenvtypes.go b/pkg/api/azenvtypes.go
index 66d7678eb..05b4ee405 100644
--- a/pkg/api/azenvtypes.go
+++ b/pkg/api/azenvtypes.go
@@ -148,6 +148,14 @@ var (
                ImageVersion:   "2019.08.09387",
        }

+       // AKSUbuntu1804PatchedOSImageConfig is the AKS image based on Ubuntu 18.04-LTS.
+       AKSUbuntu1804PatchedOSImageConfig = AzureOSImageConfig{
+               ImageOffer:     "aks",
+               ImageSku:       "aks-ubuntu-1804-201909",
+               ImagePublisher: "microsoft-aks",
+               ImageVersion:   "2019.09.24",
+       }
+
        // ACC1604OSImageConfig is the ACC image based on Ubuntu 16.04.
        ACC1604OSImageConfig = AzureOSImageConfig{
                ImageOffer:     "confidential-compute-preview",
@@ -170,13 +178,14 @@ var (
                },

                OSImageConfig: map[Distro]AzureOSImageConfig{
-                       Ubuntu:        Ubuntu1604OSImageConfig,
-                       Ubuntu1804:    Ubuntu1804OSImageConfig,
-                       RHEL:          RHELOSImageConfig,
-                       CoreOS:        CoreOSImageConfig,
-                       AKSUbuntu1604: AKSUbuntu1604OSImageConfig,
-                       AKSUbuntu1804: AKSUbuntu1804OSImageConfig,
-                       ACC1604:       ACC1604OSImageConfig,
+                       Ubuntu:               Ubuntu1604OSImageConfig,
+                       Ubuntu1804:           Ubuntu1804OSImageConfig,
+                       RHEL:                 RHELOSImageConfig,
+                       CoreOS:               CoreOSImageConfig,
+                       AKSUbuntu1604:        AKSUbuntu1604OSImageConfig,
+                       AKSUbuntu1804:        AKSUbuntu1804OSImageConfig,
+                       AKSUbuntu1804Patched: AKSUbuntu1804PatchedOSImageConfig,
+                       ACC1604:              ACC1604OSImageConfig,
                },
        }

@@ -253,12 +262,13 @@ var (
                        ResourceManagerVMDNSSuffix: "cloudapp.chinacloudapi.cn",
                },
                OSImageConfig: map[Distro]AzureOSImageConfig{
-                       Ubuntu:        Ubuntu1604OSImageConfig,
-                       Ubuntu1804:    Ubuntu1804OSImageConfig,
-                       RHEL:          RHELOSImageConfig,
-                       CoreOS:        CoreOSImageConfig,
-                       AKSUbuntu1604: AKSUbuntu1604OSImageConfig,
-                       AKSUbuntu1804: AKSUbuntu1804OSImageConfig,
+                       Ubuntu:               Ubuntu1604OSImageConfig,
+                       Ubuntu1804:           Ubuntu1804OSImageConfig,
+                       RHEL:                 RHELOSImageConfig,
+                       CoreOS:               CoreOSImageConfig,
+                       AKSUbuntu1604:        AKSUbuntu1604OSImageConfig,
+                       AKSUbuntu1804:        AKSUbuntu1804OSImageConfig,
+                       AKSUbuntu1804Patched: AKSUbuntu1804PatchedOSImageConfig,
                },
        }

diff --git a/pkg/api/const.go b/pkg/api/const.go
index a8a011d42..6cfef29c7 100644
--- a/pkg/api/const.go
+++ b/pkg/api/const.go
@@ -25,16 +25,17 @@ const (

 // Distro string consts
 const (
-       Ubuntu            Distro = "ubuntu"
-       Ubuntu1804        Distro = "ubuntu-18.04"
-       RHEL              Distro = "rhel"
-       CoreOS            Distro = "coreos"
-       AKS1604Deprecated Distro = "aks"               // deprecated AKS 16.04 distro. Equivalent to aks-ubuntu-16.04.
-       AKS1804Deprecated Distro = "aks-1804"          // deprecated AKS 18.04 distro. Equivalent to aks-ubuntu-18.04.
-       AKSDockerEngine   Distro = "aks-docker-engine" // deprecated docker-engine distro.
-       AKSUbuntu1604     Distro = "aks-ubuntu-16.04"
-       AKSUbuntu1804     Distro = "aks-ubuntu-18.04"
-       ACC1604           Distro = "acc-16.04"
+       Ubuntu               Distro = "ubuntu"
+       Ubuntu1804           Distro = "ubuntu-18.04"
+       RHEL                 Distro = "rhel"
+       CoreOS               Distro = "coreos"
+       AKS1604Deprecated    Distro = "aks"               // deprecated AKS 16.04 distro. Equivalent to aks-ubuntu-16.04.
+       AKS1804Deprecated    Distro = "aks-1804"          // deprecated AKS 18.04 distro. Equivalent to aks-ubuntu-18.04.
+       AKSDockerEngine      Distro = "aks-docker-engine" // deprecated docker-engine distro.
+       AKSUbuntu1604        Distro = "aks-ubuntu-16.04"
+       AKSUbuntu1804        Distro = "aks-ubuntu-18.04"
+       AKSUbuntu1804Patched Distro = "aks-ubuntu-18.04-patched"
+       ACC1604              Distro = "acc-16.04"
 )

 const (
diff --git a/pkg/api/defaults-kubelet_test.go b/pkg/api/defaults-kubelet_test.go
index ee9ad7cee..b35508a3f 100644
--- a/pkg/api/defaults-kubelet_test.go
+++ b/pkg/api/defaults-kubelet_test.go
@@ -530,7 +530,7 @@ func TestProtectKernelDefaults(t *testing.T) {
        // Validate that --protect-kernel-defaults is "true" by default for relevant distros
        for _, distro := range DistroValues {
                switch distro {
-               case AKSUbuntu1604, AKSUbuntu1804:
+               case AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched:
                        cs = CreateMockContainerService("testcluster", "1.10.13", 3, 2, false)
                        cs.Properties.MasterProfile.Distro = distro
                        cs.Properties.AgentPoolProfiles[0].Distro = distro
@@ -584,7 +584,7 @@ func TestProtectKernelDefaults(t *testing.T) {
        // Validate that --protect-kernel-defaults is overridable
        for _, distro := range DistroValues {
                switch distro {
-               case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804:
+               case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched:
                        cs = CreateMockContainerService("testcluster", "1.10.13", 3, 2, false)
                        cs.Properties.MasterProfile.Distro = "ubuntu"
                        cs.Properties.AgentPoolProfiles[0].Distro = "ubuntu"
diff --git a/pkg/api/defaults.go b/pkg/api/defaults.go
index 341fe0973..017c676a0 100644
--- a/pkg/api/defaults.go
+++ b/pkg/api/defaults.go
@@ -24,7 +24,7 @@ import (
 )

 // DistroValues is a list of currently supported distros
-var DistroValues = []Distro{"", Ubuntu, Ubuntu1804, RHEL, CoreOS, AKSUbuntu1604, AKSUbuntu1804, ACC1604}
+var DistroValues = []Distro{"", Ubuntu, Ubuntu1804, RHEL, CoreOS, AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched, ACC1604}

 // SetPropertiesDefaults for the container Properties, returns true if certs are generated
 func (cs *ContainerService) SetPropertiesDefaults(isUpgrade, isScale bool) (bool, error) {
diff --git a/pkg/api/types.go b/pkg/api/types.go
index 4cf648aac..bee8bc449 100644
--- a/pkg/api/types.go
+++ b/pkg/api/types.go
@@ -1236,7 +1236,7 @@ func (m *MasterProfile) IsCoreOS() bool {

 // IsVHDDistro returns true if the distro uses VHD SKUs
 func (m *MasterProfile) IsVHDDistro() bool {
-       return m.Distro == AKSUbuntu1604 || m.Distro == AKSUbuntu1804
+       return m.Distro == AKSUbuntu1604 || m.Distro == AKSUbuntu1804 || m.Distro == AKSUbuntu1804Patched
 }

 // IsVirtualMachineScaleSets returns true if the master availability profile is VMSS
@@ -1291,7 +1291,7 @@ func (m *MasterProfile) IsUbuntu1604() bool {
 // IsUbuntu1804 returns true if the master profile distro is based on Ubuntu 18.04
 func (m *MasterProfile) IsUbuntu1804() bool {
        switch m.Distro {
-       case AKSUbuntu1804, Ubuntu1804:
+       case AKSUbuntu1804, AKSUbuntu1804Patched, Ubuntu1804:
                return true
        default:
                return false
@@ -1365,7 +1365,7 @@ func (a *AgentPoolProfile) IsCoreOS() bool {

 // IsVHDDistro returns true if the distro uses VHD SKUs
 func (a *AgentPoolProfile) IsVHDDistro() bool {
-       return a.Distro == AKSUbuntu1604 || a.Distro == AKSUbuntu1804
+       return a.Distro == AKSUbuntu1604 || a.Distro == AKSUbuntu1804 || a.Distro == AKSUbuntu1804Patched
 }

 // IsAvailabilitySets returns true if the customer specified disks
@@ -1420,7 +1420,7 @@ func (a *AgentPoolProfile) IsUbuntu1604() bool {
 func (a *AgentPoolProfile) IsUbuntu1804() bool {
        if a.OSType != Windows {
                switch a.Distro {
-               case AKSUbuntu1804, Ubuntu1804:
+               case AKSUbuntu1804, AKSUbuntu1804Patched, Ubuntu1804:
                        return true
                default:
                        return false
diff --git a/pkg/api/vlabs/const.go b/pkg/api/vlabs/const.go
index 1d4efa9ab..a813f6061 100644
--- a/pkg/api/vlabs/const.go
+++ b/pkg/api/vlabs/const.go
@@ -28,16 +28,17 @@ const (

 // the LinuxDistros supported by vlabs
 const (
-       Ubuntu            Distro = "ubuntu"
-       Ubuntu1804        Distro = "ubuntu-18.04"
-       RHEL              Distro = "rhel"
-       CoreOS            Distro = "coreos"
-       AKS1604Deprecated Distro = "aks"               // deprecated AKS 16.04 distro. Equivalent to aks-ubuntu-16.04.
-       AKS1804Deprecated Distro = "aks-1804"          // deprecated AKS 18.04 distro. Equivalent to aks-ubuntu-18.04.
-       AKSDockerEngine   Distro = "aks-docker-engine" // deprecated docker-engine distro.
-       AKSUbuntu1604     Distro = "aks-ubuntu-16.04"
-       AKSUbuntu1804     Distro = "aks-ubuntu-18.04"
-       ACC1604           Distro = "acc-16.04"
+       Ubuntu               Distro = "ubuntu"
+       Ubuntu1804           Distro = "ubuntu-18.04"
+       RHEL                 Distro = "rhel"
+       CoreOS               Distro = "coreos"
+       AKS1604Deprecated    Distro = "aks"               // deprecated AKS 16.04 distro. Equivalent to aks-ubuntu-16.04.
+       AKS1804Deprecated    Distro = "aks-1804"          // deprecated AKS 18.04 distro. Equivalent to aks-ubuntu-18.04.
+       AKSDockerEngine      Distro = "aks-docker-engine" // deprecated docker-engine distro.
+       AKSUbuntu1604        Distro = "aks-ubuntu-16.04"
+       AKSUbuntu1804        Distro = "aks-ubuntu-18.04"
+       AKSUbuntu1804Patched Distro = "aks-ubuntu-18.04-patched"
+       ACC1604              Distro = "acc-16.04"
 )

 // validation values
@@ -98,7 +99,7 @@ var (
        ContainerRuntimeValues = [...]string{"", Docker, ClearContainers, KataContainers, Containerd}

        // DistroValues holds the valid values for OS distros
-       DistroValues = []Distro{"", Ubuntu, Ubuntu1804, RHEL, CoreOS, AKSUbuntu1604, AKSUbuntu1804, ACC1604}
+       DistroValues = []Distro{"", Ubuntu, Ubuntu1804, RHEL, CoreOS, AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched, ACC1604}

        // DependenciesLocationValues holds the valid values for dependencies location
        DependenciesLocationValues = []DependenciesLocation{"", AzureStackDependenciesLocationPublic, AzureStackDependenciesLocationChina, AzureStackDependenciesLocationGerman, AzureStackDependenciesLocationUSGovernment}
diff --git a/pkg/api/vlabs/types.go b/pkg/api/vlabs/types.go
index f98cc691a..15dd30bb9 100644
--- a/pkg/api/vlabs/types.go
+++ b/pkg/api/vlabs/types.go
@@ -627,7 +627,7 @@ func (m *MasterProfile) IsUbuntu1604() bool {
 // IsUbuntu1804 returns true if the master profile distro is based on Ubuntu 18.04
 func (m *MasterProfile) IsUbuntu1804() bool {
        switch m.Distro {
-       case AKSUbuntu1804, Ubuntu1804:
+       case AKSUbuntu1804, AKSUbuntu1804Patched, Ubuntu1804:
                return true
        default:
                return false
@@ -765,7 +765,7 @@ func (a *AgentPoolProfile) IsUbuntu1604() bool {
 func (a *AgentPoolProfile) IsUbuntu1804() bool {
        if a.OSType != Windows {
                switch a.Distro {
-               case AKSUbuntu1804, Ubuntu1804:
+               case AKSUbuntu1804, AKSUbuntu1804Patched, Ubuntu1804:
                        return true
                default:
                        return false
diff --git a/pkg/api/vlabs/validate_test.go b/pkg/api/vlabs/validate_test.go
index c0e5784cd..2748c5244 100644
--- a/pkg/api/vlabs/validate_test.go
+++ b/pkg/api/vlabs/validate_test.go
@@ -2277,10 +2277,10 @@ func TestProperties_ValidateVNET(t *testing.T) {
                {
                        name: "Invalid MasterProfile FirstConsecutiveStaticIP when master is VMAS",
                        masterProfile: &MasterProfile{
-                               VnetSubnetID:             validVNetSubnetID,
-                               Count:                    1,
-                               DNSPrefix:                "foo",
-                               VMSize:                   "Standard_DS2_v2",
+                               VnetSubnetID: validVNetSubnetID,
+                               Count:        1,
+                               DNSPrefix:    "foo",
+                               VMSize:       "Standard_DS2_v2",
                                FirstConsecutiveStaticIP: "10.0.0.invalid",
                        },
                        agentPoolProfiles: []*AgentPoolProfile{
@@ -2338,10 +2338,10 @@ func TestProperties_ValidateVNET(t *testing.T) {
                {
                        name: "Invalid vnetcidr",
                        masterProfile: &MasterProfile{
-                               VnetSubnetID:             validVNetSubnetID,
-                               Count:                    1,
-                               DNSPrefix:                "foo",
-                               VMSize:                   "Standard_DS2_v2",
+                               VnetSubnetID: validVNetSubnetID,
+                               Count:        1,
+                               DNSPrefix:    "foo",
+                               VMSize:       "Standard_DS2_v2",
                                FirstConsecutiveStaticIP: "10.0.0.1",
                                VnetCidr:                 "10.1.0.0/invalid",
                        },
@@ -2762,7 +2762,7 @@ func TestAgentPoolProfile_ValidateAuditDEnabled(t *testing.T) {
                                if err := cs.Properties.validateAgentPoolProfiles(false); err.Error() != expectedMsg {
                                        t.Errorf("expected error with message : %s, but got %s", expectedMsg, err.Error())
                                }
-                       case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804, ACC1604:
+                       case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched, ACC1604:
                                if err := cs.Properties.validateAgentPoolProfiles(false); err != nil {
                                        t.Errorf("AuditDEnabled should work with distro %s, got error %s", distro, err.Error())
                                }
@@ -2785,7 +2785,7 @@ func TestMasterProfile_ValidateAuditDEnabled(t *testing.T) {
                                if err := cs.Properties.validateMasterProfile(false); err.Error() != expectedMsg {
                                        t.Errorf("expected error with message : %s, but got %s", expectedMsg, err.Error())
                                }
-                       case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804, ACC1604:
+                       case Ubuntu, Ubuntu1804, AKSUbuntu1604, AKSUbuntu1804, AKSUbuntu1804Patched, ACC1604:
                                if err := cs.Properties.validateMasterProfile(false); err != nil {
                                        t.Errorf("AuditDEnabled should work with distro %s, got error %s", distro, err.Error())
                                }
@@ -3085,9 +3085,9 @@ func TestValidateLocation(t *testing.T) {
                                        },
                                        AgentPoolProfiles: []*AgentPoolProfile{
                                                {
-                                                       Name:                         "testpool",
-                                                       Count:                        1,
-                                                       VMSize:                       "Standard_D2_v2",
+                                                       Name:   "testpool",
+                                                       Count:  1,
+                                                       VMSize: "Standard_D2_v2",
                                                        AcceleratedNetworkingEnabled: to.BoolPtr(trueVal),
                                                },
                                        },
@@ -3111,9 +3111,9 @@ func TestValidateLocation(t *testing.T) {
                                        },
                                        AgentPoolProfiles: []*AgentPoolProfile{
                                                {
-                                                       Name:                                "testpool",
-                                                       Count:                               1,
-                                                       VMSize:                              "Standard_D2_v2",
+                                                       Name:   "testpool",
+                                                       Count:  1,
+                                                       VMSize: "Standard_D2_v2",
                                                        AcceleratedNetworkingEnabledWindows: to.BoolPtr(trueVal),
                                                },
                                        },
welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

devigned commented 4 years ago

I believe this issue should be resolved via: https://github.com/Azure/aks-engine/pull/2022 and #2006.

If you upgrade to https://github.com/Azure/aks-engine/releases/tag/v0.38.9, you should be fixed.

/cc @jackfrancis please correct me if I've misstated something.

jackfrancis commented 4 years ago

@PascalVA Thank you for that detailed write-up, your summary of the situation is spot on.

There are two ways to "re-deploy bug fixes" via aks-engine:

  1. Using a newer version of aks-engine that has the bug fix in it, scale out your node pool, validate that the new nodes fix the issue, and then manually delete the old instances (remember to kubectl cordon first!).
  2. Use aks-engine upgrade against an old cluster using a newer version of aks-engine. This is a monolithic operational procedure that will ARM re-deploy all nodes, including control plane nodes.

Step #1 is safer, and is more tolerant to environmental failure, but more manual. Although I would argue if you are validating whether or not a newer version of aks-engine actually has the fix you want, using that newer version and executing a scale out against the cluster you want to fix is always the first thing you should try, as the downside is minimal (a period of time when you have a broken extra node, and of course your own troubleshooting time/effort) vs potentially breaking your cluster even more. (It should be noted that step #1 doesn't address bugs you want to apply onto your control plane vms.)

You are also correct that the VHD (OS image) reference is not exposed via the api model. We could reconsider that, but so far it's made sense to essentially always pin obscure the specific VHD image behind the "distro" property, which essentially is a reference to a mutable "latest"-style image.

Definitely interested in your feedback, thanks!

jackfrancis commented 4 years ago

@PascalVA I took the liberty of renaming this issue to capture the general problem, so that other folks experiencing it might land here

I think we should aim to produce some troubleshooting documentation as an outcome of this thread to better help folks build workflows in the future.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.