eddycharly / terraform-provider-kops

Brings kOps into terraform in a fully managed way
Apache License 2.0
86 stars 20 forks source link

GPU instance groups apply loop #1064

Open ddelange opened 1 year ago

ddelange commented 1 year ago

Hi 👋

We've upgraded from kops 1.23 to 1.26 (provider 1.26.0-rc1). The upgrade was successful after some trial and error. Now, when we run apply again, the updater is always triggered:

  # kops_instance_group.workers["ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"] will be updated in-place
  ~ resource "kops_instance_group" "workers" {
        id                           = "domain.com/ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"
        name                         = "ondemand-amd-32GiB-8vCPU-1GPU-eu-central-1c"
      ~ node_labels                  = {
          - "kops.k8s.io/gpu" = "1" -> null
        }
      ~ revision                     = 7 -> 8
      ~ taints                       = [
            "arch=amd64:PreferNoSchedule",
          - "nvidia.com/gpu:NoSchedule",
        ]
        # (29 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

The corresponding RuntimeClass looks OK: https://github.com/kubernetes/kops/blob/v1.26.4/upup/models/cloudup/resources/addons/nvidia.addons.k8s.io/k8s-1.16.yaml.template#L44-L59

But somehow the node labels and taints are not in sync between kops and the cluster anymore 🤔

ddelange commented 1 year ago

same thing with public_name which we don't specify in our api block but probably gets auto-added by the kops provider:

  # kops_cluster.cluster will be updated in-place
  ~ resource "kops_cluster" "cluster" {
        id                    = "domain.com"
        name                  = "domain.com"
      ~ revision              = 35 -> 36
        # (13 unchanged attributes hidden)

      ~ api {
          - public_name     = "api.kops.domain.com" -> null
            # (2 unchanged attributes hidden)

            # (1 unchanged block hidden)
        }

        # (12 unchanged blocks hidden)
    }