aws-ia / terraform-aws-eks-blueprints-addons

Terraform module which provisions addons on Amazon EKS clusters
https://aws-ia.github.io/terraform-aws-eks-blueprints-addons/main/
Apache License 2.0
272 stars 127 forks source link

Ordering issue with AWS Load Balancer Controller 2.5.1+ #233

Closed mleklund closed 1 year ago

mleklund commented 1 year ago

Description

There is an ordering issue with AWS Load Balancer Controller 2.5.1+ if enableServiceMutatorWebhook is not set to false. Any resource that creates a kubernetes service will fail until the webhook is running. There also seems to be a circular dependency for cert-manager if the load balancer controller is installed first since it creates a service. I have mitigated this in my install by using the addon module to only install cert-manager, then use the add on to only install teh ALBC, then proceed with the full blueprints addons.

Versions

Reproduction Code [Required]

module "blueprints_addons" {
  count   = var.create && var.create_addons ? 1 : 0
  source  = "aws-ia/eks-blueprints-addons/aws"
  version = "~> 1.0"

  cluster_name      = module.eks.cluster_name
  cluster_endpoint  = module.eks.cluster_endpoint
  cluster_version   = module.eks.cluster_version
  oidc_provider_arn = module.eks.oidc_provider_arn

  enable_aws_load_balancer_controller = true

  enable_cert_manager = true

  # this does not really matter, it just needs to create a service
  helm_releases = {
    victoria-metrics-k8s-stack = {
      description      = "Victoriametrics K8s Stack Helm Chart"
      name             = "k8s"
      chart            = "victoria-metrics-k8s-stack"
      repository       = "https://victoriametrics.github.io/helm-charts/"
      version          = var.victoriameterics_chart_version
      namespace        = "monitoring"
      create_namespace = true
    }  
  }
}

Steps to reproduce the behavior:

terraform apply

Expected behaviour

I expected a clean run based on the plan

Actual behaviour

Terraform errors out, but is clear on re-run.

Terminal Output Screenshot(s)

Example of what happened with victoriametrics:

│ Error: 9 errors occurred:
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│
│

example of it happening with just cert-manager and ALBC:

╷
│ Error: 2 errors occurred:
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│   * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│
│
│
│   with module.blueprints_addons[0].module.cert_manager.helm_release.this[0],
│   on .terraform/modules/blueprints_addons.cert_manager/main.tf line 9, in resource "helm_release" "this":
│    9: resource "helm_release" "this" {
askulkarni2 commented 1 year ago

I just ran into this as well. This is from our docs.

In versions 2.5 and newer, the AWS Load Balancer Controller becomes the default controller for Kubernetes service resources with the type: LoadBalancer and makes an AWS Network Load Balancer (NLB) for each service. It does this by making a mutating webhook for services, which sets the spec.loadBalancerClass field to service.k8s.aws/nlb for new services of type: LoadBalancer. You can turn off this feature and revert to using the legacy Cloud Provider as the default controller, by setting the helm chart value enableServiceMutatorWebhook to false. The cluster won't provision new Classic Load Balancers for your services unless you turn off this feature. Existing Classic Load Balancers will continue to work.

askulkarni2 commented 1 year ago

We do not really have a way to establish an install order of addons. As a result addons that have services may timeout waiting for the webhook to be available. Users can safely turn off the webhook if they are not using the serviceType: LoadBalancer in any of their software. If they are using it then they should deploy the LBC add-on first.