aws / aws-cdk-rfcs

RFCs for the AWS CDK
Apache License 2.0
537 stars 82 forks source link

AWS Glue L2 CDK Construct #497

Closed natalie-white-aws closed 1 year ago

natalie-white-aws commented 1 year ago

Overview

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Glue was released on 2017/08. Launch

Today, customers define Glue data sources, connections, jobs, and workflows to define their data and ETL solutions via the AWS console, the AWS CLI, and Infrastructure as Code tools like CloudFormation and the CDK. However, they have challenges defining the required and optional parameters depending on job type, networking constraints for data source connections, secrets for JDBC connections, and least-privilege IAM Roles and Policies. We will build convenience methods working backwards from common use cases and default to recommended best practices.

This RFC proposes updates to the L2 construct for Glue which will provide convenience features and abstractions for the existing L1 (CloudFormation) Constructs building on the functionality already supported in the @aws-cdk/aws-glue-alpha module.

Full RFC in the PR here

Roles

Role User
Proposed by @natalie-white-aws, @mjanardhan @parag-shah-aws
Author(s) @natalie-white-aws, @mjanardhan @parag-shah-aws
API Bar Raiser @TheRealAmazonKendra

See RFC Process for details

Workflow


Author is responsible to progress the RFC according to this checklist, and apply the relevant labels to this issue so that the RFC table in README gets updated.

Create a Glue Job

The glue-alpha-module already supports three of the four common types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the more recent Flex Job. The construct also implements AWS best practice recommendations, such as:

This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. As an opinionated construct, the Glue L2 construct will enforce best practices and not allow developers to create resources that use deprecated libraries and tool sets (e.g. deprecated versions of Python).

Optional and required parameters for each job will be enforced via interface rather than validation; see Glue's public documentation for more granular details.

Spark Jobs

  1. ETL Jobs

ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 and G8 worker type default as G2, which customer can override. It will default to the best practice version of ETL 4.0, but allow developers to override to 3.0. We will also default to best practice enablement the following ETL features: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log. You can find more details about version, worker type and other features in Glue's public documentation.

glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', {
    script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
    className: 'com.example.HelloWorld',
    role: iam.IRole,
});

glue.pySparkEtlJob(this, 'pySparkEtlJob', {
    script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
    role: iam.IRole,
});

Optionally, developers can override the glueVersion and add extra jars and a description:

glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', {
   glueVersion: glue.GlueVersion.V3_0,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
   className: 'com.example.HelloWorld',
   extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),],
   description: 'an example Scala Spark ETL job',
   numberOfWorkers: 20,
   role: iam.IRole,
});

glue.pySparkEtlJob(this, 'pySparkEtlJob', {
   jobType: glue.JobType.ETL,
   glueVersion: glue.GlueVersion.V3_0,
   pythonVersion: glue.PythonVersion.3_9,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   description: 'an example pySpark ETL job',
   numberOfWorkers: 20,
   role: iam.IRole,
});
  1. Streaming Jobs

A Streaming job is similar to an ETL job, except that it performs ETL on data streams using the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs. These jobs will default to use Python 3.9.

Similar to ETL streaming job supports Scala and Python languages. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which developers can override. We will enable —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.

new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', {
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   role: iam.IRole,
});

new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', {
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
   className: 'com.example.HelloWorld',
   role: iam.IRole,
});

Optionally, developers can override the glueVersion and add extraJars and a description:

new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', {
   glueVersion: glue.GlueVersion.V3_0,
   pythonVersion: glue.PythonVersion.3_9,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   description: 'an example Python Streaming job',
   numberOfWorkers: 20,
   role: iam.IRole,
});

new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', {
   glueVersion: glue.GlueVersion.V3_0,
   pythonVersion: glue.PythonVersion.3_9,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
   extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),],
   className: 'com.example.HelloWorld',
   description: 'an example Python Streaming job',
   numberOfWorkers: 20,
   role: iam.IRole,
});
  1. Flex Jobs

The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and G.1X or G.2X worker types but will default to the latest version of Glue (currently Glue 3.0.) Similar to ETL, we’ll enable these features: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log

glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', {
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
   className: 'com.example.HelloWorld',
   role: iam.IRole,
});

glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', {
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   role: iam.IRole,
});

Optionally, developers can override the glue version, python version, provide extra jars, and a description

glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', {
   glueVersion: glue.GlueVersion.V3_0,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'),
   className: 'com.example.HelloWorld',
   extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-python-scripts')],
   description: 'an example pySpark ETL job',
   numberOfWorkers: 20,
   role: iam.IRole,
});

new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', {
   glueVersion: glue.GlueVersion.V3_0,
   pythonVersion: glue.PythonVersion.3_9,
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   description: 'an example Flex job',
   numberOfWorkers: 20,
   role: iam.IRole,
});

Python Shell Jobs

A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment.

We’ll default to PythonVersion.3_9. Python shell jobs have a MaxCapacity feature. Developers can choose MaxCapacity = 0.0625 or MaxCapacity = 1. By default, MaxCapacity will be set 0.0625. Python 3.9 supports preloaded analytics libraries using the library-set=analytics flag, and this feature will be enabled by default.

new glue.PythonShellJob(this, 'PythonShellJob', {
   script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
   role: iam.IRole,
});

Optional overrides:

new glue.PythonShellJob(this, 'PythonShellJob', {
    glueVersion: glue.GlueVersion.V1_0,
    pythonVersion: glue.PythonVersion.3_9,
    script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
    description: 'an example Python Shell job',
    numberOfWorkers: 20,
    role: iam.IRole,
});

Ray Jobs

Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime will default to Ray2.3 and min workers will default to 3.

new glue.GlueRayJob(this, 'GlueRayJob', {
    script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
    role: iam.IRole,
});

Developers can override min workers and other Glue job fields

new glue.GlueRayJob(this, 'GlueRayJob', {
  runtime: glue.Runtime.RAY_2_2,
  script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'),
  numberOfWorkers: 50,
  role: iam.IRole,
});

Uploading scripts from the same repo to S3

Similar to other L2 constructs, the Glue L2 will automate uploading / updating scripts to S3 via an optional fromAsset parameter pointing to a script in the local file structure. Developers will provide an existing S3 bucket and the path to which they'd like the script to be uploaded.

glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', {
    script: glue.Code.fromAsset('bucket-name', 'local/path/to/scala-jar'),
    className: 'com.example.HelloWorld',
});

Workflow Triggers

In AWS Glue, developers can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow.

Within the workflow object, there will be functions to create different types of triggers with actions and predicates. Those triggers can then be added to jobs.

For all trigger types, the StartOnCreation property will be set to true by default, but developers will have the option to override it.

  1. On Demand Triggers

On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler objects using conditional types.

myWorkflow = new glue.Workflow(this, "GlueWorkflow", {
    name: "MyWorkflow";
    description: "New Workflow";
    properties: {'key', 'value'};
});

myWorkflow.onDemandTrigger(this, 'TriggerJobOnDemand', {
    description: 'On demand run for ' + glue.JobExecutable.name,
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
});
  1. Scheduled Triggers

Schedule triggers are a way for developers to create jobs using cron expressions. We’ll provide daily, weekly, and monthly convenience functions, as well as a custom function that will allow developers to create their own custom timing using the existing event Schedule object without having to build their own cron expressions. (The L2 will extract the expression that Glue requires from the Schedule object). The trigger method will take an optional description and list of Actions which can refer to Jobs or crawlers via conditional types.

// Create Daily Schedule at 00 UTC
myWorkflow.dailyScheduleTrigger(this, 'TriggerCrawlerOnDailySchedule', {
    description: 'Scheduled run for ' + glue.JobExecutable.name,
    actions: [ jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
});

// Create Weekly schedule at 00 UTC on Sunday
myWorkflow.weeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', {
    description: 'Scheduled run for ' + glue.JobExecutable.name,
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
});

// Create Custom schedule, e.g. Monthly on the 7th day at 15:30 UTC
myWorkflow.customScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', {
    description: 'Scheduled run for ' + glue.JobExecutable.name,
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
    schedule: events.Schedule.cron(day: '7', hour: '15', minute: '30')
});

3. Notify Event Triggers

Workflows are mandatory for this trigger type. There are two types of notify event triggers, batching and non-batching trigger. For batching triggers, developers must specify BatchSize but for non-batching BatchSize will be set to 1. For both triggers, BatchWindow will be default to 900 seconds.

myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerBatching', {
    batchSize: int,
    jobActions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...],
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ... ]
});

myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerNonBatching', {
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
});

4. Conditional Triggers

Conditional triggers have a predicate and actions associated with them. When the predicateCondition is true, the trigger actions will be executed.

// Triggers on Job and Crawler status
myWorkflow.conditionalTrigger(this, 'conditionalTrigger', {
    description: 'Conditional trigger for ' + myGlueJob.name,
    actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...]
    predicateCondition: glue.TriggerPredicateCondition.AND,
    jobPredicates: [{'job': JobExecutable, 'state': glue.JobState.FAILED},
                    {'job': JobExecutable, 'state' : glue.JobState.SUCCEEDED}]
});

Connection Properties

A Connection allows Glue jobs, crawlers and development endpoints to access certain types of data stores.

*Secrets Management User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property.

Public FAQ

What are we launching today?

We’re launching new features to an AWS CDK Glue L2 Construct to provide best-practice defaults and convenience methods to create Glue Jobs, Connections, Triggers, Workflows, and the underlying permissions and configuration.

Why should I use this Construct?

Developers should use this Construct to reduce the amount of boilerplate code and complexity each individual has to navigate, and make it easier to create best-practice Glue resources.

What’s not in scope?

Glue Crawlers and other resources that are now managed by the AWS LakeFormation team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, in practice these ETL scripts will likely be managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both.

Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region- specific configuration to match shifting feature sets does not out-weigh the likelihood that a developer will use this construct to deploy resources to a region without a particular new feature to a region that doesn’t yet support it without researching or manually attempting to use that feature before developing it via IaC. The developer will, of course, still get feedback from the underlying Glue APIs as CloudFormation deploys the resources similar to the current CDK L1 Glue experience.

mrpackethead commented 1 year ago

Hey great to see this underway. :-) I'm really interested in this. I've been working on some L2 constructs for Lake formation..

markusl commented 1 year ago

Hey! I was wondering if / how is this RFC going to address the issue of being able to configure CSV separator and header skip in a convenient way?

Related issue https://github.com/aws/aws-cdk/issues/23132

natalie-white-aws commented 1 year ago

Hey! I was wondering if / how is this RFC going to address the issue of being able to configure CSV separator and header skip in a convenient way?

Related issue aws/aws-cdk#23132

Hi @markusl - We're focused on the ETL side of things (Jobs, Workflows, and Triggers) for this RFC, rather than the data consumption side (Tables, Crawlers, Catalogs, etc).

mrgrain commented 1 year ago

alpha has been released for a while: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-glue-alpha-readme.html