aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.5k stars 3.84k forks source link

aws-batch-alpha: lambda not authorized to perform `batch:SubmitJob` after upgrading from 2.69.0 to 2.78.0 #25574

Closed suzhoum closed 1 year ago

suzhoum commented 1 year ago

Describe the bug

I'm trying to upgrade from 2.69.0 to the latest 2.78.0, and encountered an issue when trying to perform batch:SubmitJob from a lambda function. The error message is

arn:aws:sts::xxx:assumed-role/ag-bench-test-batch-stack-agbenchtestbatchjobfunct-1AQCSFR51GLG7/ag-bench-test-batch-job-function is not authorized to perform: batch:SubmitJob on resource: arn:aws:batch:us-west-2:xxx:job-definition/jobdefinitionED9E5E04-dd5ddb78a49496b

Expected Behavior

Lambda function should be able to perform batch:SubmitJob after the upgrade to v2.79.0

Current Behavior

I tried my best to update my code to generate the exact same cloudformation template that was generated in 2.69.0, but still there are some major differences.

I'm posting the code snippet that we have changed in this project in order to upgrade:

in v2.69.0:

from aws_cdk import aws_batch_alpha as batch

container = batch.JobDefinitionContainer(
            image=docker_container_image,
            gpu_count=container_gpu,
            vcpus=container_vcpu,
            memory_limit_mib=container_memory,
            linux_params=ecs.LinuxParameters(self, f"{prefix}-linux_params", shared_memory_size=container_memory),
        )

job_definition = batch.JobDefinition(
            self,
            "job-definition",
            container=container,
            retry_attempts=3,
            timeout=core.Duration.minutes(1500),
        )

batch_instance_role = iam.Role(
            self,
            f"{prefix}-instance-role",
            assumed_by=iam.CompositePrincipal(
                iam.ServicePrincipal("ec2.amazonaws.com"),
                iam.ServicePrincipal("ecs.amazonaws.com"),
                iam.ServicePrincipal("ecs-tasks.amazonaws.com"),
            ),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AmazonEC2ContainerServiceforEC2Role"),
            ],
        )

batch_instance_profile = iam.CfnInstanceProfile(
            self, 
            f"{prefix}-instance-profile", 
            roles=[batch_instance_role.role_name]
        )

compute_environment = batch.ComputeEnvironment(
            self,
            f"{prefix}-compute-environment",
            compute_resources=batch.ComputeResources(
                allocation_strategy=batch.AllocationStrategy.BEST_FIT_PROGRESSIVE,
                vpc=vpc,
                vpc_subnets=ec2.SubnetSelection(subnets=vpc.private_subnets),
                maxv_cpus=compute_env_maxv_cpus,
                instance_role=batch_instance_profile.profile_arn,
                instance_types=instances,
                security_groups=[sg],
                type=batch.ComputeResourceType.ON_DEMAND,
                launch_template=batch.LaunchTemplateSpecification(
                    launch_template_name=batch_launch_template_name  # LaunchTemplate.launch_template_name returns None
                ),
            ),
        )

        job_queue = batch.JobQueue(
            self,
            f"{prefix}-job-queue",
            priority=1,
            compute_environments=[batch.JobQueueComputeEnvironment(compute_environment=compute_environment, order=1)],
        )

in v2.79.0

from aws_cdk import aws_batch_alpha as batch
import aws_cdk as core

container = batch.EcsEc2ContainerDefinition(
                self, 
                f"{prefix}-container-definition",
                image=docker_container_image,
                memory=core.Size.mebibytes(container_memory),
                cpu=container_vcpu,
                gpu=container_gpu,
                environment={
                    "AWS_ACCOUNT": os.environ["CDK_DEPLOY_ACCOUNT"],
                    "AWS_REGION": os.environ["CDK_DEPLOY_REGION"],
                },
                execution_role=None,
                linux_parameters=batch.LinuxParameters(self, f"{prefix}-linux-params", shared_memory_size=core.Size.mebibytes(container_memory))
            )

job_definition = batch.EcsJobDefinition(
            self, 
            f"{prefix}-job-definition",
            container=container,
            retry_attempts=3,
            timeout=core.Duration.minutes(1500)
        )

batch_service_role = iam.Role(
            self,
            f"{prefix}-service-role",
            assumed_by=iam.CompositePrincipal(
                iam.ServicePrincipal("batch.amazonaws.com"),
            ),
            managed_policies=[
                iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AWSBatchServiceRole"),
            ],
        )

compute_environment = batch.ManagedEc2EcsComputeEnvironment(self, f"{prefix}-compute-environment",
            vpc=vpc,
            vpc_subnets=ec2.SubnetSelection(subnets=vpc.private_subnets),
            allocation_strategy=batch.AllocationStrategy.BEST_FIT_PROGRESSIVE,
            maxv_cpus=compute_env_maxv_cpus,
            instance_role=batch_instance_profile,
            instance_types=instances,
            security_groups=[sg],
            launch_template=launch_template,
            service_role=batch_service_role,
            use_optimal_instance_classes=False,
            update_to_latest_image_version=False,
            replace_compute_environment=True,
        )

The key difference I see in the generated CFN from above code snippets are, in v2.79.0, there arecontainerdefinitionExecutionRole and containerdefinitionExecutionRoleDefaultPolicy created:

"agbenchtestcontainerdefinitionExecutionRole0A25AAB3": {
   "Type": "AWS::IAM::Role",
   "Properties": {
    "AssumeRolePolicyDocument": {
     "Statement": [
      {
       "Action": "sts:AssumeRole",
       "Effect": "Allow",
       "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
       }
      }
     ],
     "Version": "2012-10-17"
    },
    "Tags": [
     {
      "Key": "ag-bench-test",
      "Value": "benchmark"
     }
    ]
   },
   "Metadata": {
    "aws:cdk:path": "ag-bench-test-batch-stack/ag-bench-test-container-definition/ExecutionRole/Resource"
   }
  },
  "agbenchtestcontainerdefinitionExecutionRoleDefaultPolicy2B49DF06": {
   "Type": "AWS::IAM::Policy",
   "Properties": {
    "PolicyDocument": {
     "Statement": [
      {
       "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
       ],
       "Effect": "Allow",
       "Resource": {
        "Fn::Join": [
         "",
         [
          "arn:",
          {
           "Ref": "AWS::Partition"
          },
          ":ecr:us-west-2:097403188315:repository/cdk-hnb659fds-container-assets-097403188315-us-west-2"
         ]
        ]
       }
      },
      {
       "Action": "ecr:GetAuthorizationToken",
       "Effect": "Allow",
       "Resource": "*"
      }
     ],
     "Version": "2012-10-17"
    },
    "PolicyName": "agbenchtestcontainerdefinitionExecutionRoleDefaultPolicy2B49DF06",
    "Roles": [
     {
      "Ref": "agbenchtestcontainerdefinitionExecutionRole0A25AAB3"
     }
    ]
   },
   "Metadata": {
    "aws:cdk:path": "ag-bench-test-batch-stack/ag-bench-test-container-definition/ExecutionRole/DefaultPolicy/Resource"
   }
  },

"AWS::Batch::ComputeEnvironment" has two more properties in v2.79.0

"ComputeResources": {
    "UpdateToLatestImageVersion": false
}
"ReplaceComputeEnvironment": true,

The Lambda function's CFN remained unchanged.

Reproduction Steps

See above

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.79.9

Framework Version

No response

Node.js Version

v18.13.0

OS

ubuntu

Language

Python

Language Version

No response

Other information

No response

comcalvi commented 1 year ago

can you try passing the role, instead of the profile? I'm surprised that even compiles. Eg turn this:

compute_environment = batch.ManagedEc2EcsComputeEnvironment(self, f"{prefix}-compute-environment",
            instance_role=batch_instance_profile,
// ...
        )

into this:

compute_environment = batch.ManagedEc2EcsComputeEnvironment(self, f"{prefix}-compute-environment",
            instance_role=batch_instance_role,
// ...
        )
suzhoum commented 1 year ago

@comcalvi thanks for your response! I tried but still got the same error. We used instance_role=batch_instance_profile.profile_arn in 2.69.0 and it worked, so we kept the similar thing in the code.

comcalvi commented 1 year ago

@suzhoum can you share your lambda function's CDK definition on both versions? How does it relate to the CE?

suzhoum commented 1 year ago

Still facing the issue as of v2.89.0

comcalvi commented 1 year ago

This is another reason to add grant methods

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.