eddycharly / terraform-provider-kops

Brings kOps into terraform in a fully managed way
Apache License 2.0
85 stars 20 forks source link

Provider error when creating cluster updater #863

Closed wrossmann closed 1 year ago

wrossmann commented 1 year ago

I am trying to integrate Kops with some existing infrastructure, but the provider keeps giving me the following error when I try to apply:

kops_cluster.dp: Modifying... [id=k8s.test.company.aws]
kops_cluster.dp: Modifications complete after 3s [id=k8s.test.company.aws]
kops_cluster_updater.updater: Creating...
╷
│ Error: Request cancelled
│ 
│   with kops_cluster_updater.updater,
│   on kops.tf line 127, in resource "kops_cluster_updater" "updater":
│  127: resource "kops_cluster_updater" "updater" {
│ 
│ The plugin.(*GRPCProvider).ApplyResourceChange request was cancelled.

And what seems to be relevant output with TF_LOG=debug:

kops_cluster.dp: Modifying... [id=k8s.test.company.aws]
2022-12-19T12:51:43.162-0800 [INFO]  Starting apply for kops_cluster.dp
2022-12-19T12:51:43.165-0800 [DEBUG] kops_cluster.dp: applying the planned Update change
2022-12-19T12:51:45.965-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stderr="W1219 12:51:45.965104 1165312 vfs_castore.go:382] CA private key was not found"
2022-12-19T12:51:46.027-0800 [WARN]  Provider "provider[\"registry.terraform.io/eddycharly/kops\"]" produced an unexpected new value for kops_cluster.dp, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .admin_ssh_key: inconsistent values for sensitive attribute
      - .api: block count changed from 0 to 1
      - .authorization: block count changed from 0 to 1
kops_cluster.dp: Modifications complete after 3s [id=k8s.test.company.aws]
kops_cluster_updater.updater: Creating...
2022-12-19T12:51:46.056-0800 [INFO]  Starting apply for kops_cluster_updater.updater
2022-12-19T12:51:46.067-0800 [DEBUG] kops_cluster_updater.updater: applying the planned Create change
2022-12-19T12:51:46.492-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stderr="W1219 12:51:46.492060 1165312 populate_cluster_spec.go:147] EtcdMembers are in the same InstanceGroup "master" in etcd-cluster "main" (fault-tolerance may be reduced)"
2022-12-19T12:51:46.492-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stderr="W1219 12:51:46.492116 1165312 populate_cluster_spec.go:147] EtcdMembers are in the same InstanceGroup "master" in etcd-cluster "main" (fault-tolerance may be reduced)"
2022-12-19T12:51:46.492-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stderr="W1219 12:51:46.492148 1165312 populate_cluster_spec.go:147] EtcdMembers are in the same InstanceGroup "master" in etcd-cluster "events" (fault-tolerance may be reduced)
W1219 12:51:46.492175 1165312 populate_cluster_spec.go:147] EtcdMembers are in the same InstanceGroup "master" in etcd-cluster "events" (fault-tolerance may be reduced)"
2022-12-19T12:51:48.002-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stdout=*********************************************************************************
2022-12-19T12:51:48.002-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stdout="A new kubernetes version is available: 1.25.5
Upgrading is recommended (try kops upgrade cluster)"
2022-12-19T12:51:48.002-0800 [WARN]  unexpected data: registry.terraform.io/eddycharly/kops:stdout="More information: https://github.com/kubernetes/kops/blob/master/permalinks/upgrade_k8s.md#1.25.5

*********************************************************************************"
2022-12-19T12:51:53.850-0800 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2022-12-19T12:51:53.850-0800 [ERROR] plugin.(*GRPCProvider).ApplyResourceChange: error="rpc error: code = Unavailable desc = transport is closing"
2022-12-19T12:51:53.850-0800 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/eddycharly/kops/1.25.3/linux_amd64/terraform-provider-kops_v1.25.3 pid=1165312 error="exit status 255"
2022-12-19T12:51:53.850-0800 [ERROR] vertex "kops_cluster_updater.updater" error: Plugin did not respond

Below is my kops config:

provider "kops" {
  state_store = "s3://${aws_s3_bucket.kops-state.bucket}"
  aws {
    region  = var.env[local.env_name].region
    profile = var.env[local.env_name].config_profile
    assume_role {
      role_arn = var.env[local.env_name].role_arn
    }
  }
}

locals {
  masterType  = "t3a.medium"
  masterCount = 3
  nodeType    = "t3a.medium"
  nodeCount   = 4
  clusterName = aws_route53_zone.kops-cluster-zone.name
  dnsZone     = aws_route53_zone.kops-cluster-zone.name
  vpcId       = aws_vpc.default.id
  privateSubnets = [ for i in aws_subnet.private : { subnetId = i.id, zone = i.availability_zone } ]
  utilitySubnets = [ for i in aws_subnet.public  : { subnetId = i.id, zone = i.availability_zone } ]
}

resource "kops_cluster" "dp" {
  name               = local.clusterName
  admin_ssh_key      = file("~/.ssh/company.pub")
  kubernetes_version = "1.25.0"
  dns_zone           = local.dnsZone
  network_id         = local.vpcId

  cloud_provider {
    aws {}
  }

  iam {
    allow_container_registry = true
  }
  kubelet {
    anonymous_auth {
      value = false
    }
  }

  kube_api_server {
    anonymous_auth {
      value = false
    }
  }
  networking {
    calico {}
  }

  topology {
    masters = "private"
    nodes   = "private"
    dns {
      type = "Private"
    }
  }

  dynamic "subnet" {
    for_each    = local.privateSubnets
    content {
      name = "private-${subnet.key}"
      type = "Private"
      provider_id = subnet.value.subnetId
      zone = subnet.value.zone
    }
  }

  dynamic "subnet" {
    for_each = local.utilitySubnets
    content {
      name = "utility-${subnet.key}"
      type = "Utility"
      provider_id = subnet.value.subnetId
      zone = subnet.value.zone
    }
  }

  etcd_cluster {
    name = "main"
    dynamic "member" {
      for_each = range(0, local.masterCount)
      content {
        name = "master-${member.value}"
        instance_group = "master"
      }
    }
  }

  etcd_cluster {
    name = "events"
    dynamic "member" {
      for_each = range(0, local.masterCount)
      content {
        name = "master-${member.value}"
        instance_group = "master"
      }
    }
  }
}

resource "kops_instance_group" "master" {
  cluster_name = kops_cluster.dp.id
  name         = "master"
  role         = "Master"
  min_size     = local.masterCount
  max_size     = local.masterCount
  machine_type = local.masterType
  subnets      = [for i in range(0, length(local.privateSubnets)): "private-${i}" ]
  depends_on   = [kops_cluster.dp]
}

resource "kops_instance_group" "node" {
  cluster_name = kops_cluster.dp.id
  name         = "node"
  role         = "Node"
  min_size     = local.nodeCount
  max_size     = local.nodeCount
  machine_type = local.nodeType
  subnets      = [for i in range(0, length(local.privateSubnets)): "private-${i}" ]
  depends_on   = [kops_cluster.dp]
}

resource "kops_cluster_updater" "updater" {
  cluster_name = kops_cluster.dp.name

  keepers = {
    cluster  = kops_cluster.dp.revision
    master-0 = kops_instance_group.master.revision
    node-0   = kops_instance_group.node.revision
  }

  rolling_update {
    skip                = false
    fail_on_drain_error = true
    fail_on_validate    = true
    validate_count      = 1
  }

  validate {
    skip = false
  }

  depends_on   = [
    kops_cluster.dp,
    kops_instance_group.master,
    kops_instance_group.node
  ]
}
eddycharly commented 1 year ago

Hard to say what's going wrong. Does it fail instantly ?

wrossmann commented 1 year ago

No, TF goes through its usual flow and errors out while applying.

eddycharly commented 1 year ago

Maybe you can try to set validate.skip to true.

tf can access the api server ?

wrossmann commented 1 year ago

Setting validate.skip to true has no apparent effect.

API server? I was not aware that there was a server involved.

If you'd like to question me a bit more directly/actively I had posted this on Slack, but I thought it would be rude to flag you directly. https://kubernetes.slack.com/archives/C3QUFP0QM/p1671487804695779

eddycharly commented 1 year ago

Api server is at the heart of a kube cluster.

The provider definitely needs to access it. Depending on your network topology it will involve a vpn, direct connect or public access.

This is not a limitation of this provider, it applies to kOps CLI too. Please check your network setup.

wrossmann commented 1 year ago

Ok but there's no cluster. Am I mistaken in thinking that this provider's resources stand up and bootstrap a k8s cluster from scratch? Should these resources not first spin up the instances/instance groups via AWS APIs before trying to connect?

I'm double-checking with terraform show and both the 'master' and 'node' groups are configured to spin up 3 and 4 instances respectively, but I see no instances created. The config is created in the S3 bucket, but seeing as the updater never starts up it's never applied.

eddycharly commented 1 year ago

This provider won’t create the network (vpc, subnets, gateways, etc…).

It will create auto scaling groups and eventually a load balancer in front of your masters. You can check if some of those resources have been created in aws.

Now depending on your topology the lb could have a private ip, if it is the case you will need some kind of vpn to communicate with it.

If no cloud resources have been created it could be because something is wrong with the subnets or you didn’t provide an iam role with enough permissions.

I guess this should show up in the logs.

eddycharly commented 1 year ago

If the cluster spec is created in s3 you can try to apply with kOps CLI to see if it works.

wrossmann commented 1 year ago

Once I finagled the credentials for the CLI [which doesn't seem to support role assumption at all?] the result of

kops --name k8s.test.company.aws --state s3://company-kops-state/ update cluster

Was:

F1221 13:16:33.819378 1295836 task.go:73] found duplicate tasks with name "ManagedFile/manifests-etcdmanager-main-master": 
*fitasks.ManagedFile {"Name":"manifests-etcdmanager-main-master","Lifecycle":"Sync","Base":null,"Location":"manifests/etcd/main-master.yaml","Contents":"...","Public":null}
and
*fitasks.ManagedFile {"Name":"manifests-etcdmanager-main-master","Lifecycle":"Sync","Base":null,"Location":"manifests/etcd/main-master.yaml","Contents":"...","Public":null}

Which I've reformatted for clarity. The contents I've excerpted are:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    k8s-app: etcd-manager-main
  name: etcd-manager-main
  namespace: kube-system
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - mkfifo /tmp/pipe; (tee -a /var/log/etcd.log \u003c /tmp/pipe \u0026 ) ; exec /etcd-manager
      --backup-store=s3://company-kops-state/k8s.test.company.aws/backups/etcd/main
      --client-urls=https://__name__:4001 --cluster-name=etcd --containerized=true
      --dns-suffix=.internal.k8s.test.company.aws --grpc-port=3996 --peer-urls=https://__name__:2380
      --quarantine-client-urls=https://__name__:3994 --v=6 --volume-name-tag=k8s.io/etcd/main
      --volume-provider=aws --volume-tag=k8s.io/etcd/main --volume-tag=k8s.io/role/master=1
      --volume-tag=kubernetes.io/cluster/k8s.test.company.aws=owned \u003e /tmp/pipe 2\u003e\u00261
    image: registry.k8s.io/etcdadm/etcd-manager:v3.0.20220831@sha256:a91fdaf9b988943a9c73d422348c2383c08dfc2566d4124a39a1b3d785018720
    name: etcd-manager
    resources:
      requests:
        cpu: 200m
        memory: 100Mi
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /rootfs
      name: rootfs
    - mountPath: /run
      name: run
    - mountPath: /etc/kubernetes/pki/etcd-manager
      name: pki
    - mountPath: /var/log/etcd.log
      name: varlogetcd
  hostNetwork: true
  hostPID: true
  priorityClassName: system-cluster-critical
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: rootfs
  - hostPath:
      path: /run
      type: DirectoryOrCreate
    name: run
  - hostPath:
      path: /etc/kubernetes/pki/etcd-manager-main
      type: DirectoryOrCreate
    name: pki
  - hostPath:
      path: /var/log/etcd.log
      type: FileOrCreate
    name: varlogetcd
status: {}

This seems to have been the result of my misunderstanding of kops "instance groups" and the seemingly redundant definition of 3 different 1-member groups [aka valid config] rather than a single 3-member groups. [the config I posted, which is not valid]

Ultimately I think that the only real issue here is that the provider did not [or could not?] emit the fatal error that the kops CLI does.

Thank you so much for your help and time.

eddycharly commented 1 year ago

Yes, unfortunately logs are not well supported in tf providers. Glad that you sorted it out in the end.