awslabs / idf-modules

Industry Data Framework (IDF) IAC modules repository
Apache License 2.0
24 stars 14 forks source link

[BUG] eks with ip whitelists, using the lustre integration module fails #290

Open swirkert1 opened 1 week ago

swirkert1 commented 1 week ago

Describe the bug

When using eks with ip whitelists, using the lustre integration module fails.

To Reproduce

  1. Create eks module

path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.11.0&depth=1

with ip whitelists

  1. create fsx module

path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.11.0&depth=1

  1. create lustre-on-eks module

path: git::https://github.com/awslabs/idf-modules.git//modules/integration/fsx-lustre-on-eks?ref=release/2.11.0&depth=1

Expected behavior EKS and FSx created, connected via lustre-on-eks.

Error

I guess it cannot download the manifest becasue it has no connection to the outside anymore?:

addf-llpdrsw-integration-lustre-on-eks | 7/13 | 11:09:16 AM | CREATE_IN_PROGRESS | Custom::AWSCDK-EKS-KubernetesResource | namespace/Resource/Default (namespace177341A3) Resource creation Initiated

580 | addf-llpdrsw-integration-lustre-on-eks | 7/13 | 11:09:17 AM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | namespace/Resource/Default (namespace177341A3) Received response status [FAILED] from custom resource. Message returned: Error: Operation failed after 3 attempts: b'error: error validating "/tmp/manifest.yaml": error validating data: failed to download openapi: Get "https://2F84111B4C79BC02AE237DC7CE5AC928.yl4.eu-central-1.eks.amazonaws.com/openapi/v2?timeout=32s": dial tcp 3.64.123.127:443: i/o timeout; if you choose to ignore these errors, turn validation off with --validate=false\n'

dgraeber commented 1 week ago

Thanks for this issue. We will investigate (preliminary inspection indicates that the lustre-on-eks module doesn't support IP whitelisting, but we will look into it.

swirkert1 commented 1 week ago

Thanks Derek,

if that is the case what would be your recommendation? Replicate the behaviour of the lustre-on-eks by just issuing the necessary cluster manipulations "manually" using kubectl?

dgraeber commented 1 week ago

I think we need to investigate first before we solution.

kukushking commented 6 days ago

RCA: When using ips_to_whitelist_adhoc, a PUBLIC_AND_PRIVATE cluster API endpoint is created with public access only being limited to whitelisted IPs and private to traffic only from within the VPC. FSX for Lustre on EKS integration module deploys manifests via custom resources, which requires custom resource lambdas to be provisioned in private VPC subnets to be able to access the cluster API endpoint.

@swirkert1, a fix is merged and planned for the upcoming release. As the release is cut, please make sure to update your integration/fsx-for-lustre-on-eks manifests to the latest version and pass VpcId and PrivateSubnetIds of your EKS cluster.

  - name: VpcId
    valueFrom:
      moduleMetadata:
        group: base
        name: networking
        key: VpcId
  - name: PrivateSubnetIds
    valueFrom:
      moduleMetadata:
        group: base
        name: networking
        key: PrivateSubnetIds