aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.47k stars 3.82k forks source link

batch: Support multi-container job definitions #29326

Open ramblingenzyme opened 5 months ago

ramblingenzyme commented 5 months ago

Describe the feature

https://docs.aws.amazon.com/batch/latest/userguide/multi-container-jobs.html This indicates that both the AWS CLI/API and Cloudformation support this feature.

The README for the Batch CDK module also seems to indicate that this is possible. If it only means to talk about multi-node job definitions, that isn't clear.

image

However, I've also noticed the modified CloudFormation properties aren't documented, so I've also raised https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1941, which I assume would be a blocker.

Use Case

I want to run the Datadog Agent in a sidecar container so we can instrument our Batch Jobs with APM and ship logs to Datadog from our Batch jobs via the agent rather than through a subscription on the Log Group

Proposed Solution

No response

Other Information

No response

Acknowledgements

CDK version used

any

Environment details (OS name and version, etc.)

N/A

pahud commented 5 months ago

Yes looks like CFN only accepts single container properties https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-batch-jobdefinition-containerproperties.html

I will create an internal ticket for clarifying.

Internal tracking: V1279131720

ramblingenzyme commented 5 months ago

Thanks so much @pahud for being on top of this. My coworker figured out this feature was released the day before or just a few days before I raised the issue, so the quick response is appreciated. 😅

pahud commented 5 months ago

Self-assigned. We're still investigating.

pahud commented 5 months ago

Hi

EcsProperties is now available in the CFN document. We can define multiple-contianer for ECS in that prop now.

It should technically work but I guess we probably need to add some samples in the CDK doc.

Let me know if it works for you.

github-actions[bot] commented 5 months ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

luxaritas commented 5 months ago

This still needs to be exposed on the L2 construct, doesn't it?

ramblingenzyme commented 5 months ago

@pahud can you confirm if there should be another ticket opened to track L2 support for multi-container Batch Job Definitions, or if we can track it here?

We're working on implementing this with the L1 constructs at the moment, so we'll update you on whether they work.

phcyso commented 4 months ago

This does work ok with the L1 constructs. with a few annoyances. Particularly around fargate resources and environment vars.

for example, to add a datadog-agent container you must specify an amount of CPU and memory, but the totals on all containers must add up to a valid fargate vcpu/memory amount. This means that for a job definition you end up with something long like:


...

    const totalCpu = props.batchJobCpu || FARGATE_DEFAULT_CPU;
    const totalMem = props.batchJobMemory || FARGATE_DEFAULT_MEMORY;

    const ddCpu = 0.1;
    const ddMem = 128;

    const appCpu = totalCpu - ddCpu;
    const appMem = totalMem - ddMem;

  new batch.CfnJobDefinition(this,`JobDefinition`, {
        type: "container",
        platformCapabilities: ["FARGATE"],
        ecsProperties: {
          taskProperties: [
            runtimePlatform: {
                operatingSystemFamily: "LINUX",
                cpuArchitecture: "X86_64",
              },
             containers: [
                {
                  name: "datadog-agent",
                  image: "public.ecr.aws/datadog/agent:latest",
                  secrets: [
                    {
                      name: "DD_API_KEY",
                      valueFrom: ddApiKeySecret.secretArn,
                    },
                  ],
                  environment: [
                      {
                        name: "DD_SERVICE",
                        value: `service-name`,
                      },
                      {
                        name: "DD_ENV",
                        value: props.environment,
                      },
                      {
                        name: "ECS_FARGATE",
                        value: "true",
                      },
                      {
                        name: "DD_APM_ENABLED",
                        value: "true",
                      },
                ],
                  resourceRequirements: [
                    {
                      type: "MEMORY",
                      value: ddMem.toString(),
                    },
                    {
                      type: "VCPU",
                      value: ddCpu.toString(),
                    },
                  ],
            } ,{
                  name: "batchJob",
                  essential: true,
                  dependsOn: [
                    { condition: "START", containerName: "datadog-agent" },
                  ],
                  resourceRequirements: [
                    {
                      type: "MEMORY",
                      value: appMem.toString(),
                    },
                    {
                      type: "VCPU",
                      value: appCpu.toString(),
                    },
                  ],
                  image: "node:bookworm-slim",
                  },
                },
            }
       ]
       ...
})
pahud commented 2 months ago

This still needs to be exposed on the L2 construct, doesn't it?

Yes it would be great if we can expose that on the L2. I am making this a feature request and we welcome any PRs.