aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.52k stars 3.86k forks source link

aws_eks: Cluster creation with AlbControllerOptions is running into error #22005

Closed mrlikl closed 8 months ago

mrlikl commented 2 years ago

Describe the bug

While creating an eks cluster with eks.AlbControllerOptions, it is running into error while creating the custom resource Custom::AWSCDK-EKS-HelmChart

"Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress' "

Expected Behavior

Creation of the custom resource Custom::AWSCDK-EKS-HelmChart to be succesfull

Current Behavior

Custom::AWSCDK-EKS-HelmChart is running into error "Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress' "

Reproduction Steps

cluster = eks.Cluster( scope=self, id=construct_id, tags={"env": "production"}, alb_controller=eks.AlbControllerOptions( version=eks.AlbControllerVersion.V2_4_1 ), version=eks.KubernetesVersion.V1_21, cluster_logging=[ eks.ClusterLoggingTypes.API, eks.ClusterLoggingTypes.AUTHENTICATOR, eks.ClusterLoggingTypes.SCHEDULER, ], endpoint_access=eks.EndpointAccess.PUBLIC, place_cluster_handler_in_vpc=True, cluster_name="basking-k8s", output_masters_role_arn=True, output_cluster_name=True, default_capacity=0, kubectl_environment={"MINIMUM_IP_TARGET": "100", "WARM_IP_TARGET": "100"}, )

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.40.0

Framework Version

No response

Node.js Version

16.17.0

OS

macos 12.5.1

Language

Python

Language Version

3.10.6

Other information

No response

pahud commented 1 year ago

related to https://github.com/aws/aws-cdk/discussions/19705

pahud commented 1 year ago
image

@mrlikl I was able to deploy it with cdk 2.46.0, kubernetes 1.21 and alb controller 2.4.1. Are you still having the issue?

mrlikl commented 1 year ago

Getting the same error when default_capacity=0, the code mentioned in the description will reproduce the error now.

pahud commented 1 year ago

@mrlikl I am running the following code to reproduce this error. Will let you know when the deploy completed.

import { KubectlV23Layer } from '@aws-cdk/lambda-layer-kubectl-v23';
import {
  App, Stack,
  aws_eks as eks,
  aws_ec2 as ec2,
} from 'aws-cdk-lib';

const devEnv = {
  account: process.env.CDK_DEFAULT_ACCOUNT,
  region: process.env.CDK_DEFAULT_REGION,
};

const app = new App();

const stack = new Stack(app, 'triage-dev5', { env: devEnv });

new eks.Cluster(stack, 'Cluster', {
  vpc: ec2.Vpc.fromLookup(stack, 'Vpc', { isDefault: true }),
  albController: {
    version: eks.AlbControllerVersion.V2_4_1,
  },
  version: eks.KubernetesVersion.V1_23,
  kubectlLayer: new KubectlV23Layer(stack, 'LayerVersion'),
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUTHENTICATOR,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
  endpointAccess: eks.EndpointAccess.PUBLIC,
  placeClusterHandlerInVpc: true,
  clusterName: 'baking-k8s',
  outputClusterName: true,
  outputMastersRoleArn: true,
  defaultCapacity: 0,
  kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
});
pahud commented 1 year ago

I am getting error with the CDK code provided above:

image

Lambda Log:

[ERROR] Exception: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 17, in handler
    return helm_handler(event, context)
  File "/var/task/helm/__init__.py", line 88, in helm_handler
    helm('upgrade', release, chart, repository, values_file, namespace, version, wait, timeout, create_namespace)
  File "/var/task/helm/__init__.py", line 186, in helm
    raise Exception(output)

I am making this a P2 now and I will investigate a little bit more on this next week. If you have any possible solution please let me know. Any pull request would be highly appreciated as well.

dimmyshu commented 1 year ago

I think this issue should be prioritized, a lot of other folks running into trouble when developing on sandbox.

I have seen a lot of issue in this repo which have setting default capacity 0 but did not realized it's a bug, It really impact development productivity since cloud formation template will take hours to rollback and cleanup the resource.

m17kea commented 1 year ago

I have the same issue:

The error from CloudFormation is:

Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress\n' Logs: /aws/lambda/TestingStage-Release-awscdkawseksK-Handler886CB40B-KG9T55a3ZdwW at invokeUserFunction (/var/task/framework.js:2:6) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: 16bb84de-c183-4e1c-9e4e-cc7ec0efc5b8)
smislam commented 1 year ago

Hey @pahud. Thank you so much for looking into this.
Were you able to make any progress? I've been struggling on this for a while. Here is my latest stack Info:

    "aws-cdk-lib": "2.63.0",
    KubernetesVersion.V1_26
    AlbControllerVersion.V2_5_1
YikaiHu commented 1 year ago

Hi @pahud, still face the same issue.

I deployed the cdk in cn-north-1 region.

YikaiHu commented 1 year ago

Hi @pahud , I think I found out the root cause in my scenario. It may be caused by image can not be pulled in cn-north-1 region.

Please check:

Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": failed to resolve reference "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests v2.4.1]: 401 Unauthorized

image

k logs aws-load-balancer-controller-75c785bc8c-72zpg -n kube-system

Error from server (BadRequest): container "aws-load-balancer-controller" in pod "aws-load-balancer-controller-75c785bc8c-72zpg" is waiting to start: trying and failing to pull image

kubectl describe pod aws-load-balancer-controller-75c785bc8c-72zpg -n kube-system

Name:                 aws-load-balancer-controller-75c785bc8c-72zpg
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      aws-load-balancer-controller
Node:                 ip-10-0-3-136.cn-north-1.compute.internal/10.0.3.136
Start Time:           Mon, 17 Jul 2023 16:30:59 +0800
Labels:               app.kubernetes.io/instance=aws-load-balancer-controller
                      app.kubernetes.io/name=aws-load-balancer-controller
                      pod-template-hash=75c785bc8c
Annotations:          kubernetes.io/psp: eks.privileged
                      prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Pending
IP:                   10.0.3.160
IPs:
  IP:           10.0.3.160
Controlled By:  ReplicaSet/aws-load-balancer-controller-75c785bc8c
Containers:
  aws-load-balancer-controller:
    Container ID:  
    Image:         602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
    Image ID:      
    Ports:         9443/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /controller
    Args:
      --cluster-name=Workshop-Cluster
      --ingress-class=alb
      --aws-region=cn-north-1
      --aws-vpc-id=vpc-0e4a9201452c76b0e
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           cn-north-1
      AWS_REGION:                   cn-north-1
      AWS_ROLE_ARN:                 arn:aws-cn:iam::743271379588:role/clo-workshop-07-CLWorkshopEC2AndEKSeksClusterStack-1XO6CGEC91JGY
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jct6t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aws-load-balancer-tls
    Optional:    false
  kube-api-access-jct6t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  16m                 default-scheduler  Successfully assigned kube-system/aws-load-balancer-controller-75c785bc8c-72zpg to ip-10-0-3-136.cn-north-1.compute.internal
  Normal   Pulling    14m (x4 over 16m)   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1"
  Warning  Failed     14m (x4 over 16m)   kubelet            Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": failed to resolve reference "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests v2.4.1]: 401 Unauthorized
  Warning  Failed     14m (x4 over 16m)   kubelet            Error: ErrImagePull
  Warning  Failed     14m (x6 over 16m)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    87s (x62 over 16m)  kubelet            Back-off pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1"
YikaiHu commented 1 year ago

Seems like related to https://github.com/aws/aws-cdk/issues/22520

YikaiHu commented 1 year ago

013241004608.dkr.ecr.us-gov-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 151742754352.dkr.ecr.us-gov-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 558608220178.dkr.ecr.me-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 590381155156.dkr.ecr.eu-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-northeast-3.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.eu-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.eu-west-3.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.us-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 800184023465.dkr.ecr.ap-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 877085696533.dkr.ecr.af-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1 918309763551.dkr.ecr.cn-north-1.amazonaws.com.cn/amazon/aws-load-balancer-controller:v2.4.1 961992271922.dkr.ecr.cn-northwest-1.amazonaws.com.cn/amazon/aws-load-balancer-controller:v2.4.1

Find a solution in https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1694, you can manually replace the ecr template url in cloudformation.

https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases?page=2

mrlikl commented 11 months ago

The issue is that when the cluster is deployed with default_capacity as 0 there will not be any nodes attached to it. While installing the aws-load-balancer-controller via helm, the status goes into pending-install, the pods will be pending as no nodes available to schedule pods. The handler lambda eventually times out after 15mins and the event handler lambda will retry the installation once again. The handler lambda executes helm upgrade and errors with Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.

While this is expected as there are no nodes, I was testing by adding a check to kubectl-handler to see if nodes are 0 when the error is thrown and was able to handle the error. However, I am not sure if this is the right approach to solve this issue.

if b'another operation (install/upgrade/rollback) is in progress' in output:
                cmd_to_run = ["kubectl","get","nodes"]
                cmd_to_run.extend(['--kubeconfig', kubeconfig])
                get_nodes_output = subprocess.check_output(cmd_to_run, stderr=subprocess.STDOUT,cwd=outdir)
                if b'No resources found' in get_nodes_output:
                    return
Karatakos commented 11 months ago

@pahud out of interest is this still on the backlog or has it been deprioritized? Calling addnodegroupcapacity on the cluster doesn't work with defaultcapacity: 0 so it's not possible to use launch templates to control capacity via CDK -- as far as i've tested.

smislam commented 11 months ago

I have been Stuck on creating FargateCluster with this issue since 06/22 https://github.com/aws/aws-cdk/issues/22005#issuecomment-1603053510 . Did the 'defaultCapacity' work for you? It is not an option for fargate.

Just tried with latest version of CDK today and still having this issue. It is possible to escalate this issue please? image

PavanMudigondaTR commented 9 months ago

Could someone help me i have the same issue. Here is my repo https://github.com/PavanMudigondaTR/install-karpenter-with-cdk

pahud commented 9 months ago

It's been a while and I am now testing the following code in the latest CDK

export class EksStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // use my default VPC
    const vpc = getDefaultVpc(this);
    new eks.Cluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      version: eks.KubernetesVersion.V1_27,
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
      clusterLogging: [
        eks.ClusterLoggingTypes.API,
        eks.ClusterLoggingTypes.AUTHENTICATOR,
        eks.ClusterLoggingTypes.SCHEDULER,
      ],
      endpointAccess: eks.EndpointAccess.PUBLIC,
      placeClusterHandlerInVpc: true,
      clusterName: 'baking-k8s',
      outputClusterName: true,
      outputMastersRoleArn: true,
      defaultCapacity: 0,
      kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
    });
  }
}

For issues from @mrlikl @Karatakos @smislam @PavanMudigondaTR, I am not sure if your issues are related to this one which seems to be related with AlbController, if it doesn't come with AlbController, please open a new issue and link to this one.

@YikaiHu EKS in China is a little bit more complicated, please open a separate issue for your case in China and link to this one. Thanks.

pahud commented 9 months ago

Unfortunately I can't deploy it with the following code in my first attempt.

I am making it a p1 for now and will simplify the code hopefully to figure out the root cause.

export class EksStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // use my default VPC
    const vpc = getDefaultVpc(this);
    new eks.Cluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      version: eks.KubernetesVersion.V1_27,
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
      clusterLogging: [
        eks.ClusterLoggingTypes.API,
        eks.ClusterLoggingTypes.AUTHENTICATOR,
        eks.ClusterLoggingTypes.SCHEDULER,
      ],
      endpointAccess: eks.EndpointAccess.PUBLIC,
      placeClusterHandlerInVpc: true,
      clusterName: 'baking-k8s',
      outputClusterName: true,
      outputMastersRoleArn: true,
      defaultCapacity: 0,
      kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
    });
  }
}
mrlikl commented 9 months ago

Hello @pahud, as mentioned in my previous comment, the issue is when default capacity is set to 0. Please check this comment - https://github.com/aws/aws-cdk/issues/22005#issuecomment-1742171115

pahud commented 9 months ago

Thanks @mrlikl

OK looks like the deployment of albController depends on the availability of the nodegroup. This means

  1. albController with defaultCapacity: 0 would fail.
  2. albController with defaultCapacity or nodegroup with at least 1 available node would succeed.

In this case, we should avoid using albController with no capacity or nodegroup in the initial deployment and I doubt if we can check the node availability in CDK but at least we should note this in albController doc string.

And, there might be a chance the handler lambda might timeout before the nodes are ready and the addDependency as below might be required.

cluster.albController?.node.addDependency(cluster.defaultNodegroup!);

In terms of the EKS Fargate cluster, I am not sure if ALB controller is compatible with EKS Fargate cluster and we definitely need more tests and feedback on it. Please open a separate issue for EKS Fargate cluster with alb controller if it does have the issue because it might need different workaround.

pahud commented 9 months ago

OK I can confirm this deploys and works for me.


export class EksStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // use my default VPC
    const vpc = getDefaultVpc(this);
    const cluster = new eks.Cluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      mastersRole: new iam.Role(this, 'MasterRole', {
          assumedBy: new iam.AccountRootPrincipal(),
      }),
      version: eks.KubernetesVersion.V1_27, 
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
      defaultCapacity: 2,
    });

    cluster.albController?.node.addDependency(cluster.defaultNodegroup!);
  }
}

And this works as well for FargateCluster

   const cluster = new eks.FargateCluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      mastersRole: new iam.Role(this, 'MasterRole', {
          assumedBy: new iam.AccountRootPrincipal(),
      }),
      version: eks.KubernetesVersion.V1_27, 
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
    });

I am making this to p2 as this error can be avoided.

github-actions[bot] commented 9 months ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

PavanMudigondaTR commented 9 months ago

issue still persists. please bot don't close the ticket

smislam commented 9 months ago

Hey @pahud, thank you so much for looking into this. I concur that the issue still persist. Here is the error:

Node: v20.10.0 Npm: 10.2.5 "aws-cdk-lib": "^2.115.0" KubernetesVersion.V1_28 AlbControllerVersion.V2_6_2

EksClusterStack | 26/28 | 9:06:12 AM | CREATE_FAILED | Custom::AWSCDK-EKS-HelmChart | EksClusterS tackEksCluster922FB9AE-AlbController/Resource/Resource/Default (EksClusterStackEksCluster922FB9AEAlbContro ller1636C356) Received response status [FAILED] from custom resource. Message returned: Error: b'Release "aws-load-balancer-controller" does not exist. Installing it now.\nError: looks like "https://aws.github.io/eks-charts" is not a valid chart reposito ry or cannot be reached: Get "https://aws.github.io/eks-charts/index.yaml": dial tcp 185.199.110.153:443: connect: connection t imed out\n'

When I add your suggestion cluster.albController?.node.addDependency(cluster.defaultNodegroup!);, I get the following error:

$eks-cluster\node_modules\constructs\src\dependency.ts:91 const ret = (instance as any)[DEPENDABLE_SYMBOL]; ^ TypeError: Cannot read properties of undefined (reading 'Symbol(@aws-cdk/core.DependableTrait)')

smislam commented 9 months ago

@pahud, @mrlikl et. al,

I was able to resolve the issue. What I have found is that to create the egress controller, the code is getting helm files from Kubernetes sigs. To access those file, you must have egress enabled. In my case, I was creating my cluster in Private subnet. You need to create your cluster in a subnet with egress. SubnetType.PRIVATE_WITH_EGRESS.

Please update your Cluster and your VPC configurations to see if this gets resolved for you. My Stack completed successfully.

pahud commented 8 months ago

Thank you @smislam for the insights.

github-actions[bot] commented 8 months ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

andreprawira commented 8 months ago

@smislam SubnetType.PRIVATE_WITH_EGRESScauses RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

@pahud im still getting the same error with my python code even with default_capacity do you know where am i missing?

        vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

        # provisioning a cluster
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            version=eks.KubernetesVersion.V1_28,
            kubectl_layer=lambda_layer_kubectl_v28.KubectlV28Layer(self, "kubectl-layer"),
            cluster_name=f"{props.customer}-eks-cluster",
            default_capacity_instance=ec2.InstanceType("t3.medium"),
            default_capacity=2,
            alb_controller=eks.AlbControllerOptions(version=eks.AlbControllerVersion.V2_6_2),
            vpc=vpc,
            vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_ISOLATED)],
            masters_role=iam.Role(self, "masters-role", assumed_by=iam.AccountRootPrincipal()),
        )
pahud commented 8 months ago

@andreprawira

For some reason it will fail if vpc_subnets selection is ec2.SubnetType.PRIVATE_ISOLATED as described in https://github.com/aws/aws-cdk/issues/22005#issuecomment-1866886455.

RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

This means CDK doesn't seem to find any "private with egress" subnets in your vpc. Can you make sure you do have private subnets with egress(typically NAT gateway)?

smislam commented 8 months ago

@andreprawira, It looks like you are using a VPC (already created in another stack) that doesn't have a private subnet with egress. And, that is why you are getting that error.

vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

You will not be able to use CDK to create your stack with such configuration for the reason I mentioned earlier in my comment.. So, either update with your VPC to create new private subnet with Egress or create an entirely new VPC with SubnetType.PRIVATE_WITH_EGRESS. This will require a NAT (either gateway or instance) as @pahud mentioned.

andreprawira commented 8 months ago

@pahud @smislam so we have a product in our service catalog that deploys VPC and IGW to all of our accounts and within that product, we dont use NAT GW, rather we use a TGW in our network account (meaning all traffic goes in and out through network account, even with the VPCs in various other accounts). That is why i did a VPC from lookup cause it has already been created.

That being said, is there another way for me to use thealb_controller with the VPC, TGW, and IGW are already set up as is? Btw, i hope i am not misunderstanding you guys when you said i cant use ec2.SubnetType.PRIVATE_ISOLATED because if i look at my cluster, i can see the subnets that it uses are all private subnets (the route tables for those subnets route the traffic to TGW that exists in network account, and the RT of those subnets dont route the traffic to IGW)

Furthermore, using vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS)] causes RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public and to answer your question @pahud i could be wrong but i dont think i have private subnets with egress if it uses NAT GW, but i have a TGW, shouldnt it worked as well?

How do i use ec2.SubnetType.PRIVATE_WITH_EGRESS)] with a TGW instead of NAT GW?

smislam commented 8 months ago

@andreprawira, Your setup should work. There is a bug in the older version of CDK that has an issue with Transit Gateway. I ran into this a while back. Any chance you are using older version of CDK?
Can you please try with latest version?

andreprawira commented 8 months ago

@smislam i just updated my cdk from version 2.115.0 to2.117.0 and below is my code

vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

        # provisioning a cluster
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            version=eks.KubernetesVersion.V1_28,
            kubectl_layer=lambda_layer_kubectl_v28.KubectlV28Layer(self, "kubectl-layer"),
            # place_cluster_handler_in_vpc=True,
            cluster_name=f"{props.customer}-eks-cluster",
            default_capacity_instance=ec2.InstanceType("t3.medium"),
            default_capacity=2,
            alb_controller=eks.AlbControllerOptions(version=eks.AlbControllerVersion.V2_6_2),
            vpc=vpc,
            vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS)],
            # masters_role=iam.Role(self, "masters-role", assumed_by=iam.AccountRootPrincipal()),
        )

but i am still getting the same RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

smislam commented 8 months ago

That is strange. I am not sure what is happening @andreprawira. We will need @pahud and the AWS CDK team to look deeper into this. Happy coding and a happy New Year!

pahud commented 8 months ago

@andreprawira

I think you still can use private isolated for the vpc_subnets as below:

vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_ISOLATED)],

But if you look at the synthesized template, there could be a chance

  1. Your lambda function for kubectl handler is associated with isolated subnets, which means: a. your kubectl lambda handler may not be able to access the aws eks API endpoint through public internet unless the isolated subnets has relevant vpc endpoints enabled. b. your kubectl lambda handler may not be able to access the cluster endpoint if it's public only
  2. Your nodegroup may be deployed in the isolates subnets which may not be able to pull images from ECR public unless relevant vpc endpoint or proxy configuration is well configured.

Technically, it is possible to deploy eks cluster with isolated subnets but there're a lot of requirements you need to consider and we don't have a working sample for now and we will need more feedback from the community before we know how to do that and add it in the document.

We have a p1 tracking issue for eks cluster with isolated support at https://github.com/aws/aws-cdk/issues/12171 - we will need to close that first but that should not relevant to albcontroller.

github-actions[bot] commented 8 months ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.