garden-io / garden

Automation for Kubernetes development and testing. Spin up production-like environments for development, testing, and CI on demand. Use the same configuration and workflows at every step of the process. Speed up your builds and test runs via shared result caching
https://garden.io
Mozilla Public License 2.0
3.39k stars 275 forks source link

Garden doesn't detect when underlying IaC resources expected to change, such as OIDC tokens, are re-issued #3475

Open worldofgeese opened 1 year ago

worldofgeese commented 1 year ago

Bug

Using the gke_auth module to issue auth tokens for a GKE cluster fails at the Garden-level.

Google is now pushing for short-lived credentials everywhere but static credentials are the only way to deploy a cluster using Terraform with Garden. A cluster deployed with short-lived credentials using the Terraform plugin will run successfully and be accessible to the user for 1 hour (the TTL of an OIDC token). Afterward, all operations on the cluster will error with insufficient privileges.

Expected behavior

Garden should issue a new OIDC token when token TTL is expired using the gke_auth module from Google.

Reproducible example

Use my Dev Container example (Terraform configuration is under terraform but replace the following code):

data "google_container_cluster" "gke_cluster" {
  name     = module.gke.name
  location = module.gke.location

  project = var.project_id
}

data "template_file" "kubeconfig" {
  template = file("${path.module}/kubeconfig-template.yaml")

  vars = {
    cluster_name    = module.gke.name
    endpoint        = module.gke.endpoint
    cluster_ca      = module.gke.ca_certificate
    client_cert     = data.google_container_cluster.gke_cluster.master_auth.0.client_certificate
    client_cert_key = data.google_container_cluster.gke_cluster.master_auth.0.client_key
  }
}

resource "local_file" "kubeconfig" {
  filename = "${path.module}/kubeconfig.yaml"
  content  = data.template_file.kubeconfig.rendered
}

resource "kubernetes_cluster_role_binding" "client_admin" {
  metadata {
    name = "client-admin"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "cluster-admin"
  }
  subject {
    kind      = "User"
    name      = "client"
    api_group = "rbac.authorization.k8s.io"
  }
  subject {
    kind      = "ServiceAccount"
    name      = "default"
    namespace = "kube-system"
  }
  subject {
    kind      = "Group"
    name      = "system:masters"
    api_group = "rbac.authorization.k8s.io"
  }
}

with Google's blessed approach using temp credentials:

module "gke_auth" {
  source = "github.com/terraform-google-modules/terraform-google-kubernetes-engine//modules/auth"

  project_id   = var.project_id
  location     = module.gke.location
  cluster_name = module.gke.name
}

resource "local_file" "kubeconfig" {
  filename = "${path.module}/kubeconfig.yaml"
  content  = module.gke_auth.kubeconfig_raw
}

Workaround

Using a statically credentialed cluster.

Additional context

This does appear to be solved at the Terraform provider level. My theory is Garden's caching is actually working against it here.

Your environment

garden version 0.12.48

twelvemo commented 1 year ago

@worldofgeese do you mean the Kubernetes cluster deployed with Terraform becomes inaccessible after the OIDC token expires for the garden kubernetes provider? If so it would be useful to get a snippet of the user section in the kubeconfig.

worldofgeese commented 1 year ago

@twelvemo that's my meaning :-)

I believe this issue is related to #3732 and #3708 where Garden does not appear to be reading from the same state direct Terraform calls are.

twelvemo commented 1 year ago

@twelvemo that's my meaning :-)

I believe this issue is related to #3732 and #3708 where Garden does not appear to be reading from the same state direct Terraform calls are.

How does it relate to the terraform state? Does terraform refresh and overwrite the access-token in the kubeconfig?

worldofgeese commented 1 year ago

This issue is still present in Bonsai. To test add https:// in front of the gcp_repository_url value in my registries.tf file at the following gist

output "gcp_repository_url" {
  value = "${google_artifact_registry_repository.gcp_repo.location}-docker.pkg.dev/${var.project_id}/${google_artifact_registry_repository.gcp_repo.repository_id}"
}

like so

output "gcp_repository_url" {
  value = "https://${google_artifact_registry_repository.gcp_repo.location}-docker.pkg.dev/${var.project_id}/${google_artifact_registry_repository.gcp_repo.repository_id}"
}

then run garden deploy, then garden publish. Garden will snip the bit from the : on and try to push to https and fail. Remove the https:// added, re-deploy and it should continue to fail because it doesn't pick up on the value update.

Workaround is to run terraform plan outside Garden then re-run deploy

@edvald is this another caching issue perhaps?

edvald commented 1 year ago

This is specifically a concern when you have a Terraform stack at the provider level (as opposed to in a module/action). Provider outputs are cached (for many good reasons), so you need to run Garden with --force-refresh in this scenario.

worldofgeese commented 1 year ago

@edvald in that case looks like I may have run into a bug converting the provider to an action

kind: Deploy
type: terraform
name: my-terraform
spec: 
  autoApply: true
  variables:
    project_id: devrel-348008
✖ deploy.my-terraform  → 
Failed processing resolve Deploy type=terraform name=my-terraform. Here is the output:
────────────────────────────────────────────────────────────────────────────────
Could not find type definition for Deploy type=terraform name=my-terraform.
This is a bug. Please report it!
──────────────────────────────────────────────────────────────────────────────── (in 0 sec)
ℹ deploy.my-mongodb    → Already deployed
✖ 1 deploy action(s) failed!
worldofgeese commented 1 year ago

I can open this as a new Bonsai bug if you confirm it is in fact a bug and not an error on my part