awslabs / idf-modules

Industry Data Framework (IDF) IAC modules repository
Apache License 2.0
25 stars 14 forks source link

[BUG] two fsx instances with two lustre-integration modules fails #292

Open swirkert1 opened 1 month ago

swirkert1 commented 1 month ago

Describe the bug create an eks cluster and two fsx volums.

Now use path: git::https://github.com/awslabs/idf-modules.git//modules/integration/fsx-lustre-on-eks?ref=release/1.11.0&depth=1

two times to connect the fsx voulums to the cluster. This fails, one time with the EksHandlerRoleArn (which is not documented but needed) already existing and the second time with the set_permissions_job already existingj

To Reproduce

  1. create eks cluster
  2. create fsx volume
  3. create another fsx volume
  4. make two fsx-lustre-on-eks modules to connect the fsx with eks

Expected behavior ressources are created

Screenshots addf-llpdrsw-integration-lustre-on-eks-1a | 10/13 | 2:06:49 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | addf-llpdrsw-integration-lustr-eks-cluster/manifest-SetPermissionsJob/Resource/Default (addfllpdrswintegrationlustreksclustermanifestSetPermissionsJob14F57FB1) Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": jobs.batch "set-permissions-job" already exists\n'

swirkert1 commented 1 month ago

I think a workaround is to let them run sequentially in different groups as the bug seems to be connected to running them in parallel.

swirkert1 commented 1 month ago

Unfortunately no. While it says "SUCCEEDED" in the state, the pvc of the previous integration module was deleted. Also, when trying this out with a third fsx volume and integration module it failed again. All seems kind of random

swirkert1 commented 1 month ago

I think for the bug with the set-permissions-job we need to give it a unique name here: "metadata": {"name": "set-permissions-job", "namespace": eks_namespace}, the rest: dont know

swirkert1 commented 1 month ago

I gave a unique name to the permission jobs and removed the pv and pvc from depending on the namespace. Now it works after the second make deploy.