aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.59k stars 3.89k forks source link

(glue): Timeout and workerType missing validation for Ray job types #29612

Open haljarrett opened 6 months ago

haljarrett commented 6 months ago

Describe the bug

When developing Ray jobs via CDK, I bumped into the fact that specifying a timeout or not specifying a worker type on Job for Ray job types will succeed in synthesis, but fail deployment.

Expected Behavior

I'd expect an error to be thrown if workerType is not provided or timeout is provided, as both of these will cause deployment failures for Ray job types.

Current Behavior

Synthesis passes and failure occurs on deployment:

 UPDATE_FAILED [...] Timeout not supported for Ray jobs

or

CREATE_FAILED [...] Worker type cannot be null and only [Z.2X] worker types are supported for glueray jobs

Reproduction Steps

#!/usr/bin/env node
import * as cdk from 'aws-cdk-lib';
import { Code, GlueVersion, Job, JobExecutable, PythonVersion, Runtime, WorkerType } from '@aws-cdk/aws-glue-alpha';

const app = new cdk.App();
const stack = new cdk.Stack(app, 'glue-job-issue-stack');
const job = new Job(stack, 'MyJob', {
  jobName: 'glue-issue-test-job',
  executable: JobExecutable.pythonRay({
    glueVersion: GlueVersion.V4_0,
    pythonVersion: PythonVersion.THREE_NINE,
    runtime: Runtime.RAY_TWO_FOUR,
    script: Code.fromAsset('lib/test-job.py')
  }),
  workerType: WorkerType.G_025X,
  workerCount: 1,
  timeout: cdk.Duration.minutes(10),
});

Possible Solution

Additional validation in packages/@aws-cdk/aws-glue-alpha/lib/job.ts would address this and provide a slightly smoother developer experience when working with ray job types. I would be interested in contributing a fix if triage determines this is valid bug.

Additional Information/Context

No response

CDK CLI Version

2.133.0

Framework Version

2.133.0

Node.js Version

20.10.0

OS

macOS

Language

TypeScript

Language Version

5.3.3

Other information

No response

tim-finnigan commented 6 months ago

Thanks for creating this issue as well as the PR. I think adding validation as you described makes sense if the deployment is failing. Linking Ray job documentation for reference:

Did not see much on the topics in this issue, maybe more needs to be added on that.

haljarrett commented 6 months ago

Hey Tim, thanks for attaching the docs there - in the same vein, this is also not noted in:

The worker type constraint is mentioned, but for both, timeout is just listed as an optional parameter, no mention that it must be omitted in some circumstances. Having the only documentation be in the request failure message is not ideal, but not sure if it's something that even can be enforced at the L1 / Cfn level, so validating at the L2 construct level should hopefully avoid some frustration.