elastic / terraform-provider-ec

https://registry.terraform.io/providers/elastic/ec/latest/docs
Apache License 2.0
176 stars 88 forks source link

Resizing a deployment with terraform reset the default cloud snapshot policy to its defaults #854

Closed frconil closed 1 month ago

frconil commented 2 months ago

Readiness Checklist

Expected Behavior

Given a manifest where I define both the size of the deployment (e.g. number of nodes for a given tier) and update the default SLM policy, I would expect the SLM policy to stay the same across terraform runs if I do not modify the SLM policy, unless I changed settings that are not meant to be changed.

Current Behavior

If I modify the number of nodes, the first run also (silently) resets the SLM policy to the cloud defaults. The second terraform apply will update the SLM policy as per the terraform definition.

 Terraform definition

terraform {
  required_version = ">= 1.0.0"
  required_providers {
    ec = {
      source  = "elastic/ec"
    }
     elasticstack = {
      source  = "elastic/elasticstack"
   }
  }
}

provider "ec" {
apikey = "REDACTED"
}

resource "ec_deployment" "custom-deployment" {
  name                   = "My deployment identifier"
  region                 = "gcp-europe-west3"
  version                = "8.15.0"
  deployment_template_id = "gcp-memory-optimized-v2"

 elasticsearch = {
    hot = {
      size = "4g"
      zone_count="3"
      autoscaling = {}
    }
  }
  kibana = {}
}

provider "elasticstack" {
  elasticsearch {
    username = ec_deployment.custom-deployment-fc.elasticsearch_username
    password = ec_deployment.custom-deployment-fc.elasticsearch_password
  endpoints = ["${ec_deployment.custom-deployment-fc.elasticsearch.https_endpoint}"]
  }
}

resource "elasticstack_elasticsearch_snapshot_lifecycle" "cloud-snapshot-policy" {
  name = "cloud-snapshot-policy"
  schedule      = "0 0 1 * * ?"
  snapshot_name = "<cloud-snapshot-{now/d}>"
  repository    = "found-snapshots"
  include_global_state = true
  expire_after = "30d"
  min_count    = 5
  max_count    = 50
}

Steps to Reproduce

  1. Using the manifest above, update the zone_count parameter up or down.
  2. Run terraform apply twice
  3. first run will return:
Terraform will perform the following actions:

  # ec_deployment.custom-deployment will be updated in-place
  ~ resource "ec_deployment" "custom-deployment" {
      ~ elasticsearch          = {
          ~ hot            = {
              ~ node_roles                       = [
                  - "data_content",
                  - "data_hot",
                  - "ingest",
                  - "master",
                  - "remote_cluster_client",
                  - "transform",
                ] -> (known after apply)
              ~ zone_count                       = 3 -> 2
                # (5 unchanged attributes hidden)
            }
            # (10 unchanged attributes hidden)
        }
        id                     = "REDACTED"
        name                   = "My deployment identifier"
        # (9 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.
  1. second run will return:
Terraform will perform the following actions:

  # elasticstack_elasticsearch_snapshot_lifecycle.cloud-snapshot-policy will be updated in-place
  ~ resource "elasticstack_elasticsearch_snapshot_lifecycle" "cloud-snapshot-policy" {
      ~ expire_after         = "259200s" -> "30d"
        id                   = "7DvRmusHQTSmaV8pvdQcGw/cloud-snapshot-policy"
      ~ max_count            = 100 -> 50
      ~ min_count            = 10 -> 5
        name                 = "cloud-snapshot-policy"
      ~ partial              = true -> false
      ~ schedule             = "0 */30 * * * ?" -> "0 0 1 * * ?"
        # (7 unchanged attributes hidden)
    }

Running the same manifests without specifying the deployment size, for instance:

 elasticsearch = {
    hot = {
      autoscaling = {}
    }

does not modify the policy regardless of any resizing operations in the cloud console UI, which point the issue towards cluster sizing.

Context

This can be a problem when configuring larger retention periods than the default, as this could cause the SLM policy to delete older snapshots before the change is picked up.

As the first change also resets the SLM policy silently, this could introduce changes that are not noticed until the next terraform manifests update, if it's noticed at all.

Possible Solution

A workaround is to either proceed with resize operations via the web interface, or to manually edit the SLM retention schedule to prevent the snapshots cleanup to happen in the middle of terraform changes.

Your Environment

gigerdo commented 1 month ago

Will be fixed with the 0.12.0 release this week.