fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.93k stars 640 forks source link

CrashloopbackOff on AWS EKS AutoMode #5185

Open rlanore opened 1 month ago

rlanore commented 1 month ago

Describe the bug

Hi, i have difficulty to use fluxcd into an EKS with Auto Mode. All pods crash but i can't debug as in auto mode we don't have access to worker ( ssh or SSM ).

Is it supported ?

Steps to reproduce

I test with this terraform resources

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.31.6"

  authentication_mode = "API_AND_CONFIG_MAP"
  enable_cluster_creator_admin_permissions = true

  tags = { eks = var.cluster_name }

  cluster_name    = var.cluster_name
  cluster_version = var.cluster_version

  cluster_addons = {
    coredns = {
      preserve    = true
      most_recent = true
      timeouts = {
        create = "25m"
        delete = "10m"
      }
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent    = true
      before_compute = true
      configuration_values = jsonencode({
        env = {
          "ENABLE_PREFIX_DELEGATION" : "true",
          "WARM_PREFIX_TARGET" : "1"
        }
      })

    }
  }

  create_kms_key            = false
  cluster_encryption_config = {}
  cluster_enabled_log_types = var.cluster_enabled_log_types

  cluster_security_group_name = "${var.cluster_name}-cluster-sg"
  cluster_security_group_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = 1
  }
  node_security_group_name = "${var.cluster_name}-nodes-sg"

  cluster_compute_config = {
    enabled    = true
    node_pools = ["general-purpose"]
  }

  access_entries = {
    xxxxx
 }

  vpc_id                   = data.aws_vpc.eks_vpc.id
  subnet_ids               = data.aws_subnets.dedicated_subnets.ids
  control_plane_subnet_ids = data.aws_subnets.private_subnets.ids
  node_security_group_tags = var.node_security_group_tags

}

And deploy fluxcd with helm in terraform

resource "helm_release" "fluxcd" {
  depends_on       = [module.eks]
  chart            = "flux2"
  create_namespace = true
  namespace        = "flux-system"
  name             = "flux"
  version          = "2.14.1"
  repository       = "https://fluxcd-community.github.io/helm-charts"
}

Expected behavior

Pods of fluxcd start and running

Screenshots and recordings

No response

OS / Distro

AWS

Flux version

2.14.1

Flux check

Unable to run flux check

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

stefanprodan commented 1 month ago

You don't need SSH to debug Kubernetes pods, you can use kubectl describe and kubectl logs

matheuscscp commented 1 month ago

I created a simple EKS Auto Mode cluster, installed the flux-operator through the helm CLI and created this FluxInstance:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
  annotations:
    fluxcd.controlplane.io/reconcileArtifactEvery: 1m
  name: flux
  namespace: flux-system
spec:
  cluster:
    domain: cluster.local
    multitenant: false
    networkPolicy: true
    type: kubernetes
  components:
  - source-controller
  - kustomize-controller
  - helm-controller
  - notification-controller
  - image-reflector-controller
  - image-automation-controller
  distribution:
    registry: ghcr.io/fluxcd
    version: 2.5.x
  kustomize:
    patches:
    - patch: |-
        - op: add
          path: /spec/template/spec/containers/0/args/-
          value: --requeue-dependency=5s
      target:
        kind: Deployment
        name: (kustomize-controller|helm-controller)
  migrateResources: true
  wait: true

The pods are not crashlooping for me:

k get po -o wide
NAME                                           READY   STATUS    RESTARTS   AGE   IP             NODE                  NOMINATED NODE   READINESS GATES
flux-operator-75f4f4966c-7j58t                 1/1     Running   0          19m   172.31.2.192   i-0ecd4c3ac4cec8e29   <none>           <none>
helm-controller-58b788ff5f-g57hf               1/1     Running   0          12m   172.31.2.195   i-0ecd4c3ac4cec8e29   <none>           <none>
image-automation-controller-64d4b74986-7xmkr   1/1     Running   0          12m   172.31.2.196   i-0ecd4c3ac4cec8e29   <none>           <none>
image-reflector-controller-84b688f8b5-wxxh7    1/1     Running   0          12m   172.31.2.198   i-0ecd4c3ac4cec8e29   <none>           <none>
kustomize-controller-6f44f95669-bfmbc          1/1     Running   0          12m   172.31.2.193   i-0ecd4c3ac4cec8e29   <none>           <none>
notification-controller-5947ccbf68-fc7lb       1/1     Running   0          12m   172.31.2.194   i-0ecd4c3ac4cec8e29   <none>           <none>
source-controller-ccd874bb-xhpqf               1/1     Running   0          12m   172.31.2.197   i-0ecd4c3ac4cec8e29   <none>           <none>
stefanprodan commented 1 month ago

I guess the issue here is with the VPC/CNI where the Flux controllers can't access the Kubernetes API.