Cannot upgrade (manual or release channel) ContainerCluster or ContainerNodePool versions

tonybenchsci commented 4 years ago

Describe the bug We create GKE clusters using KCC without specifying the ContainerNodePool:spec/version nor ContainerCluster:spec/nodeVersion. After "The resource is up to date", the spec has acquired the current versions/nodeVersion of the master/node(s) created. If we have a Release Channel set, then there is an infinitely loop (gke upgrades, kcc downgrades) of version flip-flopping during the scheduled upgrade; our workaround is to always delete the default nodepool with cnrm.cloud.google.com/remove-default-node-pool annotation, and then create ContainerNodePools with spec/management/autoUpgrade set to false. In addition, if we manually upgrade a nodepool version via the GCP Console, then the version will be downgraded by KCC, causing temporary downtime but no loop as gke gives up upgrading.

ConfigConnector Version We consistently see this on all versions, and have explicitly installed and used:

1.7.0
1.6.1
1.4.0
1.0.0

To Reproduce

Run kubectl apply -f - on the following YAML, and then manually upgrade node/master version in GCP Console UI.

Alternatively wait for scheduled gke upgrade, after doing any of:

Set ContainerNodePool:spec/management/autoUpgrade::true
AddContainerCluster:spec/releaseChannel/channel:STABLE
Remove ContainerCluster:metadata/annotations/cnrm.cloud.google.com/remove-default-node-pool: "true"

YAML snippets:

---
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  annotations:
    cnrm.cloud.google.com/remove-default-node-pool: "true"
  name: koma
  namespace: <REDACTED>
spec:
  authenticatorGroupsConfig:
    securityGroup: gke-security-groups@<REDACTED>.com
  initialNodeCount: 1
  ipAllocationPolicy:
    clusterIpv4CidrBlock: /14
    servicesIpv4CidrBlock: /20
  location: us-central1-a
  loggingService: logging.googleapis.com/kubernetes
  masterAuthorizedNetworksConfig:
    cidrBlocks:
    - cidrBlock: 104.132.0.0/14
      displayName: google-office
    - cidrBlock: <REDACTED>
      displayName: office
  monitoringService: monitoring.googleapis.com/kubernetes
  workloadIdentityConfig:
    identityNamespace: <REDACTED>.svc.id.goog
---
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: koma1
  namespace: <REDACTED>
spec:
  autoscaling:
    maxNodeCount: 5
    minNodeCount: 1
  clusterRef:
    name: koma
  initialNodeCount: 1
  location: us-central1-a
  management:
    autoRepair: true
    autoUpgrade: false
  nodeConfig:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-standard-2
    metadata:
      disable-legacy-endpoints: "true"
    oauthScopes:
    - https://www.googleapis.com/auth/cloud-platform
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/trace.append

jcanseco commented 4 years ago

Hi @tonybenchsci. Thank you for the detailed bug report and workaround. This is a known issue, and we're actively working on a fix. We'll let you know when we have more information.

tonybenchsci commented 4 years ago

@jcanseco Does v1.7.2 fix this?

kibbles-n-bytes commented 4 years ago

Hey @tonybenchsci , 1.7.2 adds a patch for the following:

If the spec.releaseChannel.channel is set, then spec.nodeVersion will be synced to what GKE auto-updates it to, unless you explicitly have kubectl apply-d a particular spec.nodeVersion explicitly.

We're planning a more generic feature across our resources to allow for fields to be designated as cloud-managed (with documentation at that time to match), but we hope this patch helps fix this particular issue. Please let us know if the patch fixes the immediate issue or if there's anything else regarding ContainerCluster that is fighting updates in unexpected ways.

tonybenchsci commented 4 years ago

Thanks @kibbles-n-bytes we've installed v1.7.2 and now things get interesting:

Since we cannot update the clusters, as kcc admission control returns cannot make changes on immutable fields: [releaseChannel.channel], there seems to be two options: A: Re-create all our clusters with releaseChannel (stable or regular). This will require migrating all current application workloads! B: Use KCC abandon (annotate + delete) all our clusters (assuming nodepools don’t kick off deletion-defender - otherwise also abandon nodepools). Use gcloud beta to update releaseChannel setting, and then KCC acquire all resources. This won’t disrupt workloads, but it will need to be done before the STABLE channel moves up in version according to https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels#enrolling_an_existing_cluster_in_a_release_channel

Would it be possible to make that field mutable in admission control, so to get KCC to leverage API for enrolling an existing cluster in a release channel

caieo commented 4 years ago

Hi @tonybenchsci, Unfortunately, those two options seem like the best way to workaround this bug. Sorry that this is not the most ideal solution. The releaseChannel field should be mutable, so we're working on fixing this. We'll post updates in this issue thread.

tonybenchsci commented 4 years ago

@kibbles-n-bytes @caieo @jcanseco I suspect that you have unpinned spec.nodeVersion of ContainerClusters but still pin spec.version of ContainerNodePools, both of which we define with KCC (edit: we define clusters/pools but let versions default).

We're using KCC v1.7.2 (also just tried v1.8.0) and the GKE version loop seems to still be a problem. I just implemented option B above of subscribing to STABLE releaseChannel by abandon+acquire clusters. GCP starts upgrade, and while spec.nodeVersion on the ContainerCluster is now dynamic (fixed in v1.7.2) following GCP (1.14.10-gke.27), the spec.version on the ContainerNodePool is still trying to stabilize at the previously defaulted value of 1.14.10-gke.24.

Workaround was to abandon all nodepools during upgrade, and re-acquire afterwards (with additional spec for workloadMetadataConfig, which was defaulted before).

kibbles-n-bytes commented 4 years ago

Hey @tonybenchsci , I'm really sorry this is still causing issues. You're right; we addressed spec.nodeVersion in ContainerCluster but we have not yet addressed spec.version in ContainerNodePool, so non-default nodepools would still exhibit this behavior. We'll get a patch into next week's release for this as well.

kibbles-n-bytes commented 4 years ago

Hey @tonybenchsci , just a note: 1.9.0 is considered the delayed release for last week and does not include the change. We are targeting the 1.9.1 release later this week as the one with the patch in place.

tonybenchsci commented 4 years ago

Hey @tonybenchsci , just a note: 1.9.0 is considered the delayed release for last week and does not include the change. We are targeting the 1.9.1 release later this week as the one with the patch in place.

Awesome thanks @kibbles-n-bytes . Will 1.9.1 address both ContainerNodePool spec.version as well as the other slightly-related issue with spec.nodeCount?

xiaobaitusi commented 4 years ago

1.9.1 will have the fix for ContainerNodePool spec.version, the fix for spec.nodeCount should come afterward.

tonybenchsci commented 4 years ago

Thanks guys! (and nice to meet some of you on Hangout today) Installed 1.9.1 and closing this issue. Looking forward to the fix to ContainerNodePool spec.nodeCount and spec.initialNodeCount. Basically like the other issue points out, when a nodepool auto-scales we get "'Update call failed: the desired mutation for the following field(s) is invalid: [initialNodeCount]'"

GoogleCloudPlatform / k8s-config-connector

Cannot upgrade (manual or release channel) ContainerCluster or ContainerNodePool versions #142