aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.57k stars 3.87k forks source link

efs - ecs : Cannot re-mount an existing efs #26537

Open ETisREAL opened 1 year ago

ETisREAL commented 1 year ago

Describe the bug

Hi, hope to find you well. I am trying to mount an existing EFS to a redis ECS Task. Everything works smoothly the first creation, but no luck when trying to remount the same FS which returns a puzzling error.

Expected Behavior

I should be able to remount the EFS, afterall what is the point of the RetainPolicy otherwise?

Current Behavior

This is my code:

const qmmTasksEfsSecurityGroup = new ec2.SecurityGroup(this, `${props.STAGE}qmmTasksEfsSecurityGroup`, {
            vpc: props.vpc,
            securityGroupName: `${props.STAGE}qmmTasksEfsSecurityGroup`
        })

        let qmmTasksEfs: efs.IFileSystem
        let qmmRedisEfsAccessPoint: efs.IAccessPoint
        let qmmMongoEfsAccessPoint: efs.IAccessPoint

        // if (true) {
        if (props.recreated) {

            qmmTasksEfs = new efs.FileSystem(this, `${props.STAGE}qmmTasksEfs`, {
                fileSystemName: `${props.STAGE}qmmTasksEfs`,
                vpc: props.vpc,
                removalPolicy: cdk.RemovalPolicy.RETAIN,
                securityGroup: qmmTasksEfsSecurityGroup,
                encrypted: true,
                lifecyclePolicy: efs.LifecyclePolicy.AFTER_30_DAYS,
                enableAutomaticBackups: true
            })

            new cdk.CfnOutput(this, 'QlashMainClusterEFSID', {
                exportName: 'QlashMainClusterEFSID',
                value: qmmTasksEfs.fileSystemId
            })

            qmmRedisEfsAccessPoint = new efs.AccessPoint(this, `${props.STAGE}qmmRedisAccessPoint`, {
                fileSystem: qmmTasksEfs,
                path: '/redis',
                createAcl: {
                    ownerGid: '1001',
                    ownerUid: '1001',
                    permissions: '750'
                },
                posixUser: {
                    uid: '1001',
                    gid: '1001'
                }
            })

            qmmRedisEfsAccessPoint.applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)

            new cdk.CfnOutput(this, 'QlashMainClusterRedisAccessPointID', {
                exportName: 'QlashMainClusterRedisAccessPointID',
                value: qmmRedisEfsAccessPoint.accessPointId
            })

            qmmMongoEfsAccessPoint = new efs.AccessPoint(this, `${props.STAGE}qmmMongoAccessPoint`, {
                fileSystem: qmmTasksEfs,
                path: '/mongodb',
                createAcl: {
                    ownerGid: '1002',
                    ownerUid: '1002',
                    permissions: '750'
                },
                posixUser: {
                    uid: '1002',
                    gid: '1002'
                }
            })

            qmmMongoEfsAccessPoint.applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)

            new cdk.CfnOutput(this, 'QlashMainClusterMongoAccessPointID', {
                exportName: 'QlashMainClusterMongoAccessPointID',
                value: qmmMongoEfsAccessPoint.accessPointId
            })

        } else {

            qmmTasksEfs = efs.FileSystem.fromFileSystemAttributes(this, `${props.STAGE}qmmTasksEfs`, {
                securityGroup: qmmTasksEfsSecurityGroup,
                fileSystemId: config.QlashMainClusterEFSID
            })

            qmmRedisEfsAccessPoint = efs.AccessPoint.fromAccessPointId(this, `${props.STAGE}qmmRedisAccessPoint`, config.QlashMainClusterRedisAccessPointID)

            qmmMongoEfsAccessPoint = efs.AccessPoint.fromAccessPointId(this, `${props.STAGE}qmmMongoAccessPoint`, config.QlashMainClusterMongoAccessPointID)
        }

        // Redis

        const qmmRedisServiceSecurityGroup = new ec2.SecurityGroup(this, `${props.STAGE}qmmRedisSecurityGroup`, {
            vpc: props.vpc,
            securityGroupName: `${props.STAGE}qmmRedisSecurityGroup`
        })

        qmmTasksEfsSecurityGroup.addIngressRule(
            ec2.Peer.securityGroupId(qmmRedisServiceSecurityGroup.securityGroupId),
            ec2.Port.tcp(2049),
            'Allow inbound traffic from qmm_redis to qmmTasksEfs'
        )

        if (props.qlashMainInstanceSecurityGroup) {
            qmmRedisServiceSecurityGroup.addIngressRule(
                ec2.Peer.securityGroupId(props.qlashMainInstanceSecurityGroup.securityGroupId),
                ec2.Port.tcp(6379),
                'Allow inbound traffic to qmm_redis from qmmMain instance'
            )
        }

        qmmRedisServiceSecurityGroup.addIngressRule(
            ec2.Peer.ipv4(props.vpc.vpcCidrBlock),
            ec2.Port.tcp(6379),
            'Allow inbound traffic to qmm_redis from resources in qlashMainClusterVpc'
        )

        const qmmRedisTaskDefinition = new ecs.FargateTaskDefinition(this, `${props.STAGE.toLowerCase()}qmmRedisTask`, {
            cpu: 2048,
            memoryLimitMiB: 8192,
            volumes: [
                {
                    name: `${props.STAGE.toLowerCase()}_qmm_redis_volume`,
                    efsVolumeConfiguration: {
                        fileSystemId: qmmTasksEfs.fileSystemId,
                        transitEncryption: 'ENABLED',
                        authorizationConfig: {
                            accessPointId: qmmRedisEfsAccessPoint.accessPointId,
                            iam: 'ENABLED'
                        }
                    }
                }
            ]
        })

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: [
                    'elasticfilesystem:ClientWrite',
                    'elasticfilesystem:ClientMount',
                    'elasticfilesystem:ClientRootAccess',
                    'elasticfilesystem:DescribeMountTargets',
                    'elasticfilesystem:CreateAccessPoint',
                    'elasticfilesystem:DeleteAccessPoint'
                ],
                resources: [qmmTasksEfs.fileSystemArn],
            })
        )

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: [
                    'elasticfilesystem:DescribeAccessPoints',
                    'elasticfilesystem:DescribeFileSystems'
                ],
                resources: ["*"],
            })
        )

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: ['ec2:DescribeAvailabilityZones'],
                resources: ['*']
            })
        )

Reproduction Steps

When running the following code trying to remount the EFS, you will get this error:

ResourceInitializationError: failed to invoke EFS utils commands to set up EFS volumes: stderr: Failed to resolve "fs-006afd6cee7891114.efs.eu-central-1.amazonaws.com" - check that your file system ID is correct, and ensure that the VPC has an EFS mount target for this file system ID. See https://docs.aws.amazon.com/console/efs/mount-dns-name for more detail. Attempting to lookup mount target ip address using botocore. Failed to import necessary dependency botocore, please install botocore first. : unsuccessful EFS utils command execution; code: 1

What realy sounds strange is this:

Attempting to lookup mount target ip address using botocore. Failed to import necessary dependency botocore, please install botocore first

Possible Solution

I don't even know if this is something that is up to you guys or if it is an internal error from EFS itself

Additional Information/Context

I've tried giving the task permissions on everything, just to check if it was a permission issue, but to no good

CDK CLI Version

2.88

Framework Version

No response

Node.js Version

v18.15.0

OS

Linux - Ubuntu

Language

Typescript

Language Version

No response

Other information

No response

peterwoodworth commented 1 year ago

Have you tried researching the service errors? e.g. https://repost.aws/knowledge-center/fargate-unable-to-mount-efs

This doesn't look like it's a CDK issue at first glance, but rather either a configuration issue or possibly a service bug. But we can't rule anything out yet, I'm just curious how much you've looked into + double checked the configuration

ETisREAL commented 1 year ago

@peterwoodworth I will try out the trobleshooting procedures indicated in the link. So far, I've tried to grant all IAM permissions to the task (just to see if the issue was there), I've also retained the Security Group, which (as expected honestly) don't make a difference and in terms of the task I am running is the same container I am using in the first creation, which is a redis:alpine instance.

I doubt myself that this issue is CDK related. Where should I bring this up though? Is there a specific AWS Forum for each service?

peterwoodworth commented 1 year ago

I doubt myself that this issue is CDK related. Where should I bring this up though? Is there a specific AWS Forum for each service?

Well, it might be CDK related. I didn't look at this in-depth enough to rule out CDK. Though, I'm not super familiar with these services so I'm not sure without a deep dive.

Another place to receive help is premium support, or repost

Let me know if you are able to figure out where the error is coming from or if you've been able to unblock

ETisREAL commented 1 year ago

@peterwoodworth as of now, still stuck. Will let you know if I figure this out. Thank you Peter

peterwoodworth commented 1 year ago

Sorry, what exactly is it that you mean by "remount" the file system?

ETisREAL commented 1 year ago

@peterwoodworth yeah sorry, it is vague. Basically I mean reusing, reattaching an existing EFS (that had a RetainPolicy.RETAIN set for instance) when launching a stack, so that the ECS tasks that were using said EFS, could mount it again on the same AccessPoint and retrieve the data. Does it make sense?

peterwoodworth commented 1 year ago

Sorry for the delay @ETisREAL,

I'm not exactly sure what you mean. If you provide clear repro steps, including the code deployed at each step it would be really helpful. Especially if it's a full reproduction that's as minimized as possible

github-actions[bot] commented 1 year ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

ETisREAL commented 1 year ago

No worries @peterwoodworth Thanks for the help either way.

In order to reproduce it:

  1. Launch this stack first:

cdk deploy "StackName"

const qmmTasksEfsSecurityGroup = new ec2.SecurityGroup(this, `qmmTasksEfsSecurityGroup`, {
            vpc: props.vpc,
            securityGroupName: `qmmTasksEfsSecurityGroup`
        })

        let qmmTasksEfs: efs.IFileSystem
        let qmmRedisEfsAccessPoint: efs.IAccessPoint

        if (true) {
            qmmTasksEfs = new efs.FileSystem(this, `qmmTasksEfs`, {
                fileSystemName: `qmmTasksEfs`,
                vpc: props.vpc,
                removalPolicy: cdk.RemovalPolicy.RETAIN,
                securityGroup: qmmTasksEfsSecurityGroup,
                encrypted: true,
                lifecyclePolicy: efs.LifecyclePolicy.AFTER_30_DAYS,
                enableAutomaticBackups: true
            })

            new cdk.CfnOutput(this, 'QlashMainClusterEFSID', {
                exportName: 'QlashMainClusterEFSID',
                value: qmmTasksEfs.fileSystemId
            })

            qmmRedisEfsAccessPoint = new efs.AccessPoint(this, `qmmRedisAccessPoint`, {
                fileSystem: qmmTasksEfs,
                path: '/redis',
                createAcl: {
                    ownerGid: '1001',
                    ownerUid: '1001',
                    permissions: '750'
                },
                posixUser: {
                    uid: '1001',
                    gid: '1001'
                }
            })

            qmmRedisEfsAccessPoint.applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)

            new cdk.CfnOutput(this, 'QlashMainClusterRedisAccessPointID', {
                exportName: 'QlashMainClusterRedisAccessPointID',
                value: qmmRedisEfsAccessPoint.accessPointId
            })

        } else {

            qmmTasksEfs = efs.FileSystem.fromFileSystemAttributes(this, `qmmTasksEfs`, {
                securityGroup: qmmTasksEfsSecurityGroup,
                fileSystemId: config.QlashMainClusterEFSID
            })

            qmmRedisEfsAccessPoint = efs.AccessPoint.fromAccessPointId(this, `qmmRedisAccessPoint`, config.QlashMainClusterRedisAccessPointID)

        // Redis

        const qmmRedisServiceSecurityGroup = new ec2.SecurityGroup(this, `qmmRedisSecurityGroup`, {
            vpc: props.vpc,
            securityGroupName: `qmmRedisSecurityGroup`
        })

        qmmTasksEfsSecurityGroup.addIngressRule(
            ec2.Peer.securityGroupId(qmmRedisServiceSecurityGroup.securityGroupId),
            ec2.Port.tcp(2049),
            'Allow inbound traffic from qmm_redis to qmmTasksEfs'
        )

        const qmmRedisTaskDefinition = new ecs.FargateTaskDefinition(this, `qmmRedisTask`, {
            cpu: 2048,
            memoryLimitMiB: 8192,
            volumes: [
                {
                    name: `qmm_redis_volume`,
                    efsVolumeConfiguration: {
                        fileSystemId: qmmTasksEfs.fileSystemId,
                        transitEncryption: 'ENABLED',
                        authorizationConfig: {
                            accessPointId: qmmRedisEfsAccessPoint.accessPointId,
                            iam: 'ENABLED'
                        }
                    }
                }
            ]
        })

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: [
                    'elasticfilesystem:ClientWrite',
                    'elasticfilesystem:ClientMount',
                    'elasticfilesystem:ClientRootAccess',
                    'elasticfilesystem:DescribeMountTargets',
                    'elasticfilesystem:CreateAccessPoint',
                    'elasticfilesystem:DeleteAccessPoint'
                ],
                resources: [qmmTasksEfs.fileSystemArn],
            })
        )

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: [
                    'elasticfilesystem:DescribeAccessPoints',
                    'elasticfilesystem:DescribeFileSystems'
                ],
                resources: ["*"],
            })
        )

        qmmRedisTaskDefinition.addToTaskRolePolicy(
            new iam.PolicyStatement({
                actions: ['ec2:DescribeAvailabilityZones'],
                resources: ['*']
            })
        )

const qmmRedisContainer = qmmRedisTaskDefinition.addContainer(`qmm_redis`, {
            image: ecs.ContainerImage.fromAsset('redis'),
            containerName: `qmm_redis`,
            portMappings: [{ containerPort: 6379, name: `qmm_redis` }],
            healthCheck: {
                command: ["CMD", "redis-cli", "-h", "localhost", "-p", "6379", "ping"],
                interval: cdk.Duration.seconds(20),
                timeout: cdk.Duration.seconds(20),
                retries: 5
            },
            logging: ecs.LogDriver.awsLogs({streamPrefix: `qmm_redis`, logRetention: RetentionDays.ONE_DAY}),
            command: ["redis-server", "/usr/local/etc/redis/redis.conf"]
        })

        qmmRedisContainer.addMountPoints({
            sourceVolume: `qmm_redis_volume`,
            containerPath: '/redis/data',
            readOnly: false
        })

        const qmmRedisService = new ecs.FargateService(this, `qmmRedisService`, {
            serviceName: `qmmRedisService`,
            cluster: qlashMainCluster,
            desiredCount: 1,
            securityGroups: [qmmRedisServiceSecurityGroup],
            taskDefinition: qmmRedisTaskDefinition
        })
  1. Tear down the stack:

cdk destroy "StackName"

This will retain the EFS and Access Point because of the cdk.RetainPolic.RETAIN

  1. Redeploy the stack
    • change the if (true) { to if (false) { to make sure it is using this same EFS, and not creating a new one.
    • hardcode (or import from the cdk.json file, as you prefer) the values of the filesystem ID and access point ID here

cdk deploy "StackName"

qmmTasksEfs = efs.FileSystem.fromFileSystemAttributes(this, `qmmTasksEfs`, {
                securityGroup: qmmTasksEfsSecurityGroup,
                fileSystemId: config.QlashMainClusterEFSID
            })

            qmmRedisEfsAccessPoint = efs.AccessPoint.fromAccessPointId(this, `qmmRedisAccessPoint`, config.QlashMainClusterRedisAccessPointID)
cosbor11 commented 1 year ago

Im having this same issue, when importing a efs file system

pahud commented 10 months ago

Does it only happen when importing or re-using an existing EFS filesystem?

ResourceInitializationError: failed to invoke EFS utils commands to set up EFS volumes: stderr: Failed to resolve "fs-006afd6cee7891114.efs.eu-central-1.amazonaws.com" - check that your file system ID is correct, and ensure that the VPC has an EFS mount target for this file system ID. See https://docs.aws.amazon.com/console/efs/mount-dns-name for more detail. Attempting to lookup mount target ip address using botocore. Failed to import necessary dependency botocore, please install botocore first. : unsuccessful EFS utils command execution; code: 1

After you destroy the stack with removal policy as RETAIN, are you still able to see/list this filesystem ID in the EFS console? Not sure if this is a bug but looks like this filesystem ID is invalid when the resource is destroyed with retain removal policy?

ETisREAL commented 10 months ago

@pahud yes I can still see the filesystem from the console and list it with its id

vipulaSD commented 9 months ago

I'm having the same issue. One observation which has not mentioned in the previous discussion,

When CDK destroy the current stack, it deletes the mount targets from the EFS

ETisREAL commented 9 months ago

I'm having the same issue. One observation which has not mentioned in the previous discussion,

When CDK destroy the current stack, it deletes the mount targets from the EFS

Even when you explicitly set the lifecycle policy to retain ? ex.

qmmRedisEfsAccessPoint.applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)

Because I haven't noticed it if I set the policy to RETAIN

vipulaSD commented 9 months ago

I'm having the same issue. One observation which has not mentioned in the previous discussion, When CDK destroy the current stack, it deletes the mount targets from the EFS

Even when you explicitly set the lifecycle policy to retain ? ex.

qmmRedisEfsAccessPoint.applyRemovalPolicy(cdk.RemovalPolicy.RETAIN)

Because I haven't noticed it if I set the policy to RETAIN

Yes, I have set the removal policy for the AccessPoint, access point retains but entries in the "network" tab are get deleted.

UPDATE: I have manually created the entries in the network tab. After that the service starts as expected

srshi commented 8 months ago

I had same issue. In my code I don't retain VPC, so it makes sense, because mount targets have ENI and deletion of VPC will fail if it has ENI. So I solved this issue by creating new mount targets with CfnMountTarget when I reuse File System. Almost the same method as @vipulaSD wrote.