digitalocean / terraform-provider-digitalocean

Terraform DigitalOcean provider
https://registry.terraform.io/providers/digitalocean/digitalocean/latest/docs
Mozilla Public License 2.0
503 stars 270 forks source link

Do not delete / recreate cluster when changing node pool #424

Open gibiansky opened 4 years ago

gibiansky commented 4 years ago

When you create a Kubernetes cluster with one node pool, and then afterward change the node pool properties (e.g. size), it will delete the entire cluster.

This is similar to this issue. If you believe this is a duplicate issue, feel free to close this issue.

This behaviour is problematic because it messes up any authentication you might have, among other things.

It would be great if this would add a node pool, then delete the previous node pool, without deleting the cluster. This is possible via the DigitalOcean UI, though I don't know if there's a technical reason this behaviour isn't implemented or hard to implement.

Thanks!

andrewsomething commented 4 years ago

Just dropping some notes here in case anyone else takes a look at this one...

--- a/digitalocean/resource_digitalocean_kubernetes_node_pool.go
+++ b/digitalocean/resource_digitalocean_kubernetes_node_pool.go
@@ -43,6 +43,9 @@ func nodePoolResourceSchema() map[string]*schema.Schema {
                ForceNew:     true,
        }

+       // Size should force a new node pool but not a new cluster
+       s["size"].ForceNew = true
+
        // remove the id when this is used in a specific resource
        // not as a child
        delete(s, "id")
@@ -65,7 +68,6 @@ func nodePoolSchema() map[string]*schema.Schema {
                "size": {
                        Type:         schema.TypeString,
                        Required:     true,
-                       ForceNew:     true,
                        ValidateFunc: validation.NoZeroValues,
                },
"Do you want to perform these actions?"
  # digitalocean_kubernetes_cluster.foo will be updated in-place
  ~ resource "digitalocean_kubernetes_cluster" "test" {

   # snip...  

      ~ node_pool {
            actual_node_count = 2
            auto_scale        = false
            id                = "348c6394-44d5-41e2-914a-721dfe759ed6"
            labels            = {}
            max_nodes         = 0
            min_nodes         = 0
            name              = "default"
            node_count        = 2
            nodes             = [
                {
                    created_at = "2020-06-25 18:19:23 +0000 UTC"
                    droplet_id = "197500730"
                    id         = "73a3aa23-d48f-4833-8e00-68e6ed9a908d"
                    name       = "default-3ypbn"
                    status     = "running"
                    updated_at = "2020-06-25 18:24:31 +0000 UTC"
                },
                {
                    created_at = "2020-06-25 18:19:23 +0000 UTC"
                    droplet_id = "197500732"
                    id         = "66f40ab6-ca25-4eb6-80be-145a91bb88cd"
                    name       = "default-3ypb3"
                    status     = "running"
                    updated_at = "2020-06-25 18:24:31 +0000 UTC"
                },
            ]
          ~ size              = "s-2vcpu-2gb" -> "s-1vcpu-2gb"
            tags              = []
        }

Nodes are first drained before being deleted. So in theory, depending on the work load, this might not cause down time if the new pool is up first. Though we shouldn't make assumptions about what is running in the cluster.

Using SetNewComputed or SetNew on the entire node_pool does not work as not all of the attributes are computed.

This will also complicates https://github.com/terraform-providers/terraform-provider-digitalocean/issues/303

CapCap commented 4 years ago

This is also critical for our use case as well :-)

Would it simplify anything to make the node_pool optional, and require there to be an associated digitalocean_kubernetes_node_pool resource?

morgangrubb commented 3 years ago

I have just tested and the digitalocean_kubernetes_node_pool resource behaves exactly the same way when changing the node size - that is, it destroys the pool and then creates a new one.

tomjohnburton commented 3 years ago

Would love this feature as well

loarca commented 2 years ago

I'd LOVE this change, for real

dvcrn commented 2 years ago

Scaling the default node pool in general is a bit funky. If I change the size of the default node pool I am not getting a full recreation, but a lot of these

dial tcp 127.0.0.1:80: connect: connection refused

Since it can't connect to the node pool anymore. I tried creating a separate node pool with digitalocean_kubernetes_node_pool, then changing the default node pool name to be identical to the created one in the hopes that it wouldn't break, but same connection refused error.

I don't mind doing manual steps like manually adding the new nodepool then removing the old one, but I couldn't find a proper way to tell the provider that this new node pool should be the default one

mkjmdski commented 2 years ago

This is a real buzz killer that you need to recreate control plane to resize default node pool. I'd like to be able to create kubernetes cluster without default node pool, there is no way to perform any stable updates with current setup... Seems like this issue here is for very long can someone from @digitalocean take a look into that ?

lemeurherve commented 2 years ago

As a workaround, for Jenkins Infrastructure we went on a cluster with a minimal node pool just for keeping our cluster safe, and another autoscaled node pool which can be scaled and modified at ease. It burns a little bit more resources but it's the only way we found to be able to tune our cluster nodes. Allowing node pool to scale to zero would be greatly appreciated for this second autoscaled pool too, but that's another topic ^^

mkjmdski commented 2 years ago

If somebody doesn't like to burn resources I've figured out how to do it :). According to the docs (https://registry.terraform.io/providers/digitalocean/digitalocean/latest/docs/resources/kubernetes_node_pool)

Note: If the node pool has the terraform:default-node-pool tag, then it is a default node pool for an existing cluster. 
The provider will refuse to import the node pool in that case because the node pool is managed by the digitalocean_kubernetes_cluster resource and not by this digitalocean_kubernetes_node_pool resource.

So I ended up writting code like that


resource "digitalocean_kubernetes_cluster" "cluster" {
  name          = var.name
  region        = var.region
  auto_upgrade  = true
  version       = data.digitalocean_kubernetes_versions.version.latest_version
  vpc_uuid      = var.vpc_uuid
  surge_upgrade = false

  node_pool {
    name       = format("%s-hacky", local.node_pool_name)
    size       = "s-1vcpu-1gb"
    node_count = 1
  }

  maintenance_policy {
    start_time = "04:00"
    day        = "monday"
  }

  lifecycle {
    ignore_changes = [
      node_pool.0
    ]
  }
}

resource "null_resource" "remove_cluster_node_pool" {
  triggers = {
    cluster_id = digitalocean_kubernetes_cluster.cluster.id
  }
  provisioner "local-exec" {
    command = format("%s/remove-default-node-pool.sh %s", path.module, digitalocean_kubernetes_cluster.cluster.id)
  }
}

resource "digitalocean_kubernetes_node_pool" "default_node_pool" {
  cluster_id = digitalocean_kubernetes_cluster.cluster.id
  name       = local.node_pool_name
  size       = var.size
  auto_scale = true
  min_nodes = 1
  max_nodes = 1
  tags       = local.tags
}

which creates node pool together with default node pool and after that null resource runs a script

#!/bin/bash
DOCTL_VERSION="1.70.0"
cluster_name="$1"
wget "https://github.com/digitalocean/doctl/releases/download/v${DOCTL_VERSION}/doctl-${DOCTL_VERSION}-linux-amd64.tar.gz"
tar xf "doctl-${DOCTL_VERSION}-linux-amd64.tar.gz"
if ./doctl --access-token $DIGITALOCEAN_TOKEN kubernetes cluster node-pool list "${cluster_name}" | grep -v 'hacky'; then
    node_pool_id=$(./doctl --access-token $DIGITALOCEAN_TOKEN kubernetes cluster node-pool list "${cluster_name}" | grep 'hacky' | awk '{print $1}')
    ./doctl --access-token $DIGITALOCEAN_TOKEN kubernetes cluster node-pool "${cluster_name}" "${node_pool_id}" --force
fi

it's just imporant that your local.tags contains terraform:default-node-pool. Now you need to manually remove cluster from state and reimport it. You should have now node pool separated from main cluster. This is super hacky but achieves the goal :)

lemeurherve commented 2 years ago

Interesting, thank very much for the detailed information @mkjmdski !

darzanebor commented 1 year ago

up

flyte commented 1 year ago

Would it simplify anything to make the node_pool optional, and require there to be an associated digitalocean_kubernetes_node_pool resource?

This would be preferable, as we may create a K8s cluster with some small shared-CPU instances, then need to scale up to general purpose nodes. If we could simply create a new node pool, apply, then remove/scale to zero the original one, then this wouldn't really be a problem. It's only such a big issue because we can't get rid of that default node pool completely and we're stuck with whatever node size we chose for the lifetime of the K8s cluster.

flyte commented 1 year ago

I was able to work around this manually by:

  1. Add new worker pool in DO UI
  2. Wait for nodes to become Ready in kubectl get nodes
  3. Remove old worker pool in DO UI
  4. Add the terraform:default-node-pool tag to the new worker pool in DO UI
  5. Update the worker size and pool name in the digitalocean_kubernetes_cluster resource's node_pool block
  6. Remove the k8s cluster from the Terraform state with terraform state rm digitalocean_kubernetes_cluster.k8s
  7. Import the k8s cluster with terraform import digitalocean_kubernetes_cluster.k8s <my k8s cluster id> using the cluster's ID that you can get from the DO UI
  8. Run terraform apply to make sure there are no further changes

I had some issues with the node_pool size being 0 but it may be because I renamed my node pool after creating it.

MatsKeWorks commented 10 months ago

@ChiefMateStarbuck what is the status of this at the moment? There are currently a lot of issues regarding this problem.

ChiefMateStarbuck commented 10 months ago

CC @andrewsomething @danaelhe