aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.6k stars 3.9k forks source link

aws-step-function-tasks: Allow removing launchType from RunEcsEc2Task #7967

Open mb-dev opened 4 years ago

mb-dev commented 4 years ago

capacity providers allow using ECS with dynamic capacity. https://aws.amazon.com/tw/about-aws/whats-new/2019/12/amazon-ecs-capacity-providers-now-available/ , yet CloudFormation and CDK do not yet support this feature https://github.com/aws/aws-cdk/issues/5471.

I propose as a stop gap solution to allow users to submit EC2 tasks without launchType: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-launchType

if launchType is provided, there is no way to get the task to use the capacity providers that are created outside CDK.

Use Case

I want to create capacity providers using boto3 and use them when launching EC2 tasks using step functions.

Proposed Solution

Provider a way to make launchType optional here: https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-stepfunctions-tasks/lib/ecs/run-ecs-ec2-task.ts#L62

Or to change it after construction.

Other


This is a :rocket: Feature Request

shivlaks commented 4 years ago

since RunEcsEc2Task and RunEcsFargateTask are going to be marked as @deprecated, this feature should be introduced in RunTask

shivlaks commented 4 years ago

@mb-dev PR incoming once I merge in #8451 to make launchType optional. Step Functions does not support specifying capacityProviderStrategy as a parameter.

As per documentation you've referenced

If a capacityProviderStrategy is specified, the launchType parameter must be omitted. If no capacityProviderStrategy or launchType is specified, the defaultCapacityProviderStrategy for the cluster is used.

Although this unlocks some usage, it's still incomplete until users can supply a capacity provider strategy. Since we're limited I'm inclined to agree with your suggested stop-gap. It will still be compatible when capacityProviderStrategy can be supplied.

edit: added following question:

What happens if a cluster does not have a default capacity provider strategy? I believe that's optional too?

shivlaks commented 4 years ago

per @mb-dev in comment

@shivlaks what alarmed me is that the PR makes the definition depend more on launch type than before. launchTarget sounds like launchType, and it's tied to either EC2/Fargate. I am not sure if it will be clear how to express a undefined launchType while still specifying parameters to be used at creation.

Currently no launchType might look like:

const runTask = new tasks.EcsRunTask(stack, 'Run', {
integrationPattern: sfn.IntegrationPattern.RUN_JOB,
cluster,
taskDefinition,
launchTarget: new tasks.EcsEc2LaunchTarget({
launchType: undefined,
}),
});

If launchTarget would be named something like clusterConfiguration it might be easier to parse.

shivlaks commented 4 years ago

@mb-dev launchType was never really exposed as a property (with the old classes or with the new implementation) The thinking was to associate launchTarget closely to launchType.

I was thinking of introducing something like IEcsCapacityProvider to add the capability to specify capacity providers. Users would specify either capacity providers or a launch target (which would become optional).

const runTask = new tasks.EcsRunTask(stack, 'Run', {
    integrationPattern: sfn.IntegrationPattern.RUN_JOB,
    cluster,
    taskDefinition,
    capacityProviders: [new tasks.EcsEc2CapacityProvider({
      ...
    })],
  });

and adding types for EcsFargateCapacityProvider and EcsEc2CapacityProvider classes to help populate that list of capacity providers.

what do you think?

mb-dev commented 4 years ago

@shivlaks thanks for moving the discussion here. The capacity providers are a property of ECS cluster, while the task level property is choosing one of the providers specified in the cluster definition. Given that users might create ECS cluster or capacity providers outside of CDK, and that there's an option for default capacity provider (as this stop gap suggests)

there are actually 3 scenarios for using runTask in step functions: 1) users specifying launchType 2) users specifying capacityProvider 3) users specifying neither capacity provider or launchtype - in this case the default capacity provider is used. See: https://docs.aws.amazon.com/AmazonECS/latest/userguide/cluster-capacity-providers.html

Default capacity provider strategy
A default capacity provider strategy is associated with each Amazon ECS cluster. This determines the capacity provider strategy the cluster will use if no other capacity provider strategy or launch type is specified when running a task or creating a service.

One idea is to expose DEFAULT_CAPACITY_PROVIDER constant that can be assigned to launchType or launchTarget that will make it clear we want default capacity provider

mb-dev commented 4 years ago

Capacity Providers were added to Cloudformation: https://github.com/aws/aws-cdk/commit/4ce27f4195c70bd9e365ec0e0df5c0ede863bc8a#diff-4a4e15a081904ee16b4d84e2f5cf5aee

So now it's even more important to be able to omit launchType.

Related: https://github.com/aws/aws-cdk/issues/5471

shivlaks commented 4 years ago

@mb-dev

The recommended workaround for coverage that isn't quite supported in the aws-stepfunctions-tasks module is to supply the needed ASL as a custom state. You can create your task, and supply the .toStateJson() as input (you'd need to adjust for any parameters you want to add/remove/modify)

We're still discussing options, but are not convinced that the stop-gap needs to be introduced directly into the task or folded into the EcsEc2LaunchTarget / EcsFargateLaunchTarget classes yet.

launchtarget as a whole should likely be something we can omit. Can you help me better understand the EC2 / Fargate specific parameters that would need to be supplied (with the omission of launch target). I don't mind the constant idea but will want to better understand what the ASL looks like for these use cases.

mb-dev commented 4 years ago

Ooo, i totally missed the escape hatch in the changelog. I will try that tomorrow and maybe that will be enough until full capacity providers support is added.

My pre-refactor EC2 integration is straightforward (in Python):

sft.RunEcsEc2Task(
            cluster=cluster,
            task_definition=task_definition,
            integration_pattern=sfn.ServiceIntegrationPattern.SYNC,
            container_overrides={...},
        )

So I personally won't need launchTarget, just not launchType.

shivlaks commented 4 years ago

@mb-dev let me know if I can help with checking out the implementation of the custom state or if it would help for me to whip up an example!

PierreKiwi commented 3 years ago

Hello! I have been using the CustomState to avoid this problem but now I am facing the problem of losing ResultPath during the "translation" (cf. this issue #8754).

Not really sure there is an easy way to avoid the problem (I can wrap my step in a parallel step) but annoying...

trobert2 commented 2 years ago

is there any progress here? what does the roadmap look like? One year later, I still can't get the task to run on the correct capacity provider.

ferrarijefferson commented 1 year ago

is there any progress here? what does the roadmap look like? One year later, I still can't get the task to run on the correct capacity provider.

You can use FARGATE_SPOT with Step Functions, as shown in my tests. You just need to omit the LaunchType field in your state machine definition.

image

Remove the field "LaunchType": "FARGATE" from the parameters.

{
  "Comment": "State machine integrated with ECS",
  "StartAt": "Initial",
  "States": {
    "Initial": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.sync",
      "Parameters": {
        "Cluster": "${ECS_CLUSTER_ARN}",
        "PlatformVersion": "LATEST",
        "TaskDefinition": "${ECS_TASK_DEFINITION_ARN}",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": ${SUBNETS},
            "AssignPublicIp": "ENABLED"
          }
        },
        "Overrides": {
          "ContainerOverrides": ${ECS_CONTAINER_OVERRIDES}
        }
      },
      "TimeoutSeconds": 3600,
      "End": true
    }
  }
}
kevinbader commented 1 year ago

Just bumped into this as well. My use case is spawning long-running tasks on (expensive) GPU instances that are scaled to zero most of the time. When the StepFunction workflow is at EcsRunTask, the expectation is that the cluster spins up the GPU instance to do the work. But that only works if the default capacity provider is used; currently, the execution always fails with

No Container Instances were found in your cluster. (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException; Request ID: 5652eb66-58cf-4720-a4d5-40bedf7ddf22; Proxy: null)

danw-mpl commented 10 months ago

I created a class which implements IEcsLaunchTarget so this is a drop-in workaround:

export class EcsFargateSpotLaunchTarget implements sfnTasks.IEcsLaunchTarget {
    /**
     * Launch the ECS task using Fargate Spot.
     * 
     * This functionality is not built into CDK yet.
     */
    constructor(private readonly cluster: ecs.Cluster, private readonly options?: sfnTasks.EcsFargateLaunchTargetOptions) { }

    /**
     * Called when the Fargate launch type configured on RunTask
     */
    public bind(_task: sfnTasks.EcsRunTask, launchTargetOptions: sfnTasks.LaunchTargetBindOptions): sfnTasks.EcsLaunchTargetConfig {
        this.cluster.enableFargateCapacityProviders();

        if (!launchTargetOptions.taskDefinition.isFargateCompatible) {
            throw new Error('Supplied TaskDefinition is not compatible with Fargate');
        }

        return {
            parameters: {
                PlatformVersion: this.options?.platformVersion,
                "CapacityProviderStrategy": [
                    {
                        "CapacityProvider": "FARGATE_SPOT",
                        "Weight": 1
                    }
                ],
            },
        };
    }
}
const task = new sfnTasks.EcsRunTask(this, 'RunTask', {
    ...
    launchTarget: new EcsFargateSpotLaunchTarget(ecsCluster)
});
Muppets commented 10 months ago

I like @danw-mpl solution. I went for a slightly different approach and removed the LaunchType property entirely, allowing the ECS task to take the default capacity provider settings from the cluster.

C# example:

    public class EcsRunTaskWithoutLaunchType : EcsRunTask
    {
        public EcsRunTaskWithoutLaunchType(Construct scope, string id, IEcsRunTaskProps props) : base(scope, id, props)
        {
        }

        protected EcsRunTaskWithoutLaunchType(ByRefValue reference) : base(reference)
        {
        }

        protected EcsRunTaskWithoutLaunchType(DeputyProps props) : base(props)
        {
        }

        public override JObject ToStateJson()
        {
            var stateJson = base.ToStateJson();

            ((JObject)stateJson["Parameters"]!).Remove("LaunchType");

            return stateJson;
        }
    }
danw-mpl commented 10 months ago

@Muppets I tried that initially and ran into an error from Step Functions - something like 'no container instances found in your cluster'.

I'd assume that's a problem with my cluster config though. Even though I had fargate capacity providers enabled.

pwrmiller commented 3 months ago

I've also run into this issue. Here is an implementation for Python that can be used, and is compatible with clusters that contain a default capacity provider and those that do not:

class EcsRunTakWithoutLaunchType(EcsRunTask):
    """Workaround for https://github.com/aws/aws-cdk/issues/7967"""

    def __init__(self, *args, **kwargs):
        self.cluster = kwargs.get('cluster')
        if not self.cluster:
            raise ValueError('cluster is required')
        super().__init__(*args, **kwargs)

    def to_state_json(self) -> typing.Mapping[typing.Any, typing.Any]:
        # if the cluster has a default capacity provider, we pop the LaunchType parameter
        # to use it by default (rather than attempting to use EC2 directly)
        state_json = super().to_state_json()

        if self.cluster.default_capacity_provider_strategy:
            state_json['Parameters'].pop('LaunchType')
            return state_json

        return state_json

The advantage here is that we can keep the EcsEc2LaunchTarget if the cluster has no default capacity provider defined.