aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.36k stars 3.77k forks source link

ecs: "The provided launch template does not expose its user data" when trying to add a second capacity provider #30742

Open rantoniuk opened 3 days ago

rantoniuk commented 3 days ago

Describe the bug

The code below works perfectly fine until the line ----- inf1, so with one gpuCapacityProvider. When trying to add additional inf1CP capacity provider, with a new LaunchTemplate that does not mention anything about UserData, it errors out on cdk diff with:

Error: The provided launch template does not expose its user data.
    at AutoScalingGroup.get userData [as userData] (infra/cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.js:1:24056)
    at AutoScalingGroup.addUserData (infra/cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.js:1:22335)
    at Cluster.configureAutoScalingGroup (infra/cdk/node_modules/aws-cdk-lib/aws-ecs/lib/cluster.js:1:11190)
    at Cluster.addAsgCapacityProvider (infra/cdk/node_modules/aws-cdk-lib/aws-ecs/lib/cluster.js:1:9915)
    at new EcsStack (infra/cdk/lib/ecs-stack.ts:130:18)
    at Object.<anonymous> (infra/cdk/bin/cdk.ts:35:13)
    at Module._compile (node:internal/modules/cjs/loader:1358:14)
    at Module.m._compile (infra/cdk/node_modules/ts-node/src/index.ts:1618:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1416:10)
    at Object.require.extensions.<computed> [as .ts] (infra/cdk/node_modules/ts-node/src/index.ts:1621:12)

Subprocess exited with error 1

which is specifically caused by this line:

    this.cluster.addAsgCapacityProvider(inf1CP);

import { Stack, StackProps } from 'aws-cdk-lib';
import { AutoScalingGroup, IAutoScalingGroup } from 'aws-cdk-lib/aws-autoscaling';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { AsgCapacityProvider, Cluster } from 'aws-cdk-lib/aws-ecs';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';
import { IEnvironmentConfig } from './helpers/environment-config';

interface EcsStackProps extends StackProps {
  envv: IEnvironmentConfig;
  vpc: ec2.Vpc;
}

export class EcsStack extends Stack {
  readonly cluster: Cluster;
  readonly execRole: iam.IRole;
  readonly gpuAutoScalingGroup: IAutoScalingGroup;

  constructor(scope: Construct, id: string, props: EcsStackProps) {
    super(scope, id, props);

    this.cluster = new Cluster(this, 'EcsCluster', {
      clusterName: 'EcsCluster',
      vpc: props.vpc,
    });

    // Ec2 Security Group
    const gpuinstanceSecurityGroup = new ec2.SecurityGroup(this, 'EcsGpuInstanceSg', {
      securityGroupName: 'EcsGpuInstanceSg',
      description: ' security group for gpu instances for ecs tasks',
      vpc: props.vpc,
    });

    // EC2 Execution Role with access to ECS actions
    const ltRole = new iam.Role(this, 'EcsClusterRole', {
      roleName: 'ecs-cluster-role',
      assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchAgentServerPolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role'),
      ],
    });

    const rootVolume: ec2.BlockDevice = {
      deviceName: '/dev/xvda',
      volume: ec2.BlockDeviceVolume.ebs(100),
    };

    // set GPU as the default for Docker
    const userData = ec2.UserData.forLinux();
    userData.addCommands(
      'sudo rm /etc/sysconfig/docker',
      'echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker',
      'echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker',
      'echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker',
      'sudo systemctl restart docker',
    );

    // GPU EC2 Launch Template
    const launchTemplate = new ec2.LaunchTemplate(this, 'EcsClusterLt', {
      launchTemplateName: 'ecs-gpu-lt',
      machineImage: ec2.MachineImage.genericLinux({
        // ecs optimised image with gpu support
        'us-west-2': 'ami-027492973b111510a',
      }),
      instanceType: new ec2.InstanceType('g4dn.xlarge'),
      role: ltRole,
      userData: userData,
      securityGroup: gpuinstanceSecurityGroup,
      blockDevices: [rootVolume],
      requireImdsv2: true,
    });

    // Add GPU autoscaling capacity provider to the cluster
    const gpuAutoScalingGroup = new AutoScalingGroup(this, 'EcsGpuASG', {
      autoScalingGroupName: 'EcsGpuASG',
      vpc: props.vpc,
      launchTemplate,
      minCapacity: 0,
      maxCapacity: 1,
    });

    //Add the capacity to the cluster
    const gpuCapacityProvider = new AsgCapacityProvider(this, 'EcsGpuCapacityProvider', {
      autoScalingGroup: gpuAutoScalingGroup,
      capacityProviderName: 'gpuCapacityProvider',
    });

    this.cluster.addAsgCapacityProvider(gpuCapacityProvider);

    this.cluster.addDefaultCloudMapNamespace({
      name: 'local',
      useForServiceConnect: true,
    });

    // ---------------- inf1

    // GPU EC2 Launch Template
    const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage: ec2.MachineImage.genericLinux({
        // aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended
        'us-west-2': 'ami-00a3a4671e9889e76',
      }),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });

    const inf1ASG = new AutoScalingGroup(this, 'EcsInf1ASG', {
      autoScalingGroupName: 'EcsInf1ASG',
      vpc: props.vpc,
      launchTemplate: launchTemplateInf1,
      minCapacity: 0,
      maxCapacity: 1,
    });

    //Add the capacity to the cluster
    const inf1CP = new AsgCapacityProvider(this, 'EcsInf1CapacityProvider', {
      autoScalingGroup: inf1ASG,
      capacityProviderName: 'Inf1AsgCapacityProvider',
    });

    this.cluster.addAsgCapacityProvider(inf1CP);

    this.cluster.addDefaultCapacityProviderStrategy([
      { capacityProvider: gpuCapacityProvider.capacityProviderName, weight: 1 },
      { capacityProvider: inf1CP.capacityProviderName, weight: 0 },

    ]);
  }
}

Expected Behavior

-

Current Behavior

-

Reproduction Steps

-

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.146.0 (build b368c78)

Framework Version

No response

Node.js Version

v20.13.1

OS

MacOS

Language

TypeScript

Language Version

"typescript": "~5.2.0"

Other information

No response

ashishdhingra commented 2 days ago

@rantoniuk Good afternoon. Thanks for opening the issue. The error is perhaps thrown here. Please refer to section Clusters in Amazon ECS Construct Library README. It mentions that To use LaunchTemplate with AsgCapacityProvider, make sure to specify the userData in the LaunchTemplate. Does the error goes away once you explicitly specify userData in 2nd LaunchTemplate (as you did in the 1st LaunchTemplate)?

We also have an open issue https://github.com/aws/aws-cdk/issues/26035#issuecomment-1600839939 to improve error messaging in case user data is missing from launch template, however, don't have ETA as of now.

Thanks, Ashish

pahud commented 2 days ago

Yes.

If you look at the stack trace, it fails at this method:

AutoScalingGroup.addUserData

message: The provided launch template does not expose its user data.

And if you check here:

https://github.com/aws/aws-cdk/blob/b7f626b9c845dd4161c4d37fd32835438b05124a/packages/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.ts#L1702-L1712

If launchTemplate is provided, it has to have userData attribute.

Looking at your launchTemplateInf1 obviously it's missing the userData:

const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage: ec2.MachineImage.genericLinux({
        // aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended
        'us-west-2': 'ami-00a3a4671e9889e76',
      }),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });
rantoniuk commented 2 days ago

Yes, I confirm that fixes the issue:

 const userDataInf1= ec2.UserData.forLinux();

    // GPU EC2 Launch Template
    const launchTemplateInf1 = new ec2.LaunchTemplate(this, 'EcsClusterInf1', {
      machineImage:
        ec2.MachineImage.fromSsmParameter(
          '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended/image_id',
        ),
      instanceType: new ec2.InstanceType('inf1.2xlarge'),
      role: ltRole,
      userData: userDataInf1,
      securityGroup: gpuinstanceSecurityGroup,
      // blockDevices: [rootVolume],
      requireImdsv2: true,
    });

However let me ask a follow-up questions then:

  1. Is this a Cloudformation requirement or CDK requirement? If the latter, then I would say that instead of README, CDK should automatically add ec2.UserData.forLinux() unless otherwise defined.

  2. Unrelated to the initial issue, but when I tried to use:

    machineImage: ec2.MachineImage.genericLinux({
      machineImage:
        ec2.MachineImage.fromSsmParameter(
          '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended',
        ),
      }),

    then Cloudformation complained that it can't find imageId. I had to use an undocumented suffix, so '/aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended/image_id' - maybe something to be added to the documentation directly.