Closed natalie-white-aws closed 1 year ago
Hey great to see this underway. :-) I'm really interested in this. I've been working on some L2 constructs for Lake formation..
Hey! I was wondering if / how is this RFC going to address the issue of being able to configure CSV separator and header skip in a convenient way?
Related issue https://github.com/aws/aws-cdk/issues/23132
Hey! I was wondering if / how is this RFC going to address the issue of being able to configure CSV separator and header skip in a convenient way?
Related issue aws/aws-cdk#23132
Hi @markusl - We're focused on the ETL side of things (Jobs, Workflows, and Triggers) for this RFC, rather than the data consumption side (Tables, Crawlers, Catalogs, etc).
alpha has been released for a while: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-glue-alpha-readme.html
Overview
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Glue was released on 2017/08. Launch
Today, customers define Glue data sources, connections, jobs, and workflows to define their data and ETL solutions via the AWS console, the AWS CLI, and Infrastructure as Code tools like CloudFormation and the CDK. However, they have challenges defining the required and optional parameters depending on job type, networking constraints for data source connections, secrets for JDBC connections, and least-privilege IAM Roles and Policies. We will build convenience methods working backwards from common use cases and default to recommended best practices.
This RFC proposes updates to the L2 construct for Glue which will provide convenience features and abstractions for the existing L1 (CloudFormation) Constructs building on the functionality already supported in the @aws-cdk/aws-glue-alpha module.
Full RFC in the PR here
Roles
Workflow
status/proposed
)status/review
)api-approved
applied to pull request)status/final-comments-period
)status/approved
)status/planning
)status/implementing
)status/done
)Create a Glue Job
The glue-alpha-module already supports three of the four common types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the more recent Flex Job. The construct also implements AWS best practice recommendations, such as:
This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. As an opinionated construct, the Glue L2 construct will enforce best practices and not allow developers to create resources that use deprecated libraries and tool sets (e.g. deprecated versions of Python).
Optional and required parameters for each job will be enforced via interface rather than validation; see Glue's public documentation for more granular details.
Spark Jobs
ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 and G8 worker type default as G2, which customer can override. It will default to the best practice version of ETL 4.0, but allow developers to override to 3.0. We will also default to best practice enablement the following ETL features:
—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.
You can find more details about version, worker type and other features in Glue's public documentation.Optionally, developers can override the glueVersion and add extra jars and a description:
A Streaming job is similar to an ETL job, except that it performs ETL on data streams using the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs. These jobs will default to use Python 3.9.
Similar to ETL streaming job supports Scala and Python languages. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which developers can override. We will enable
—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log
.Optionally, developers can override the glueVersion and add extraJars and a description:
The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and
G.1X
orG.2X
worker types but will default to the latest version of Glue (currently Glue 3.0.) Similar to ETL, we’ll enable these features:—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log
Optionally, developers can override the glue version, python version, provide extra jars, and a description
Python Shell Jobs
A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment.
We’ll default to
PythonVersion.3_9
. Python shell jobs have a MaxCapacity feature. Developers can choose MaxCapacity =0.0625
or MaxCapacity =1
. By default, MaxCapacity will be set0.0625
. Python 3.9 supports preloaded analytics libraries using thelibrary-set=analytics
flag, and this feature will be enabled by default.Optional overrides:
Ray Jobs
Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime will default to
Ray2.3
and min workers will default to 3.Developers can override min workers and other Glue job fields
Uploading scripts from the same repo to S3
Similar to other L2 constructs, the Glue L2 will automate uploading / updating scripts to S3 via an optional fromAsset parameter pointing to a script in the local file structure. Developers will provide an existing S3 bucket and the path to which they'd like the script to be uploaded.
Workflow Triggers
In AWS Glue, developers can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow.
Within the workflow object, there will be functions to create different types of triggers with actions and predicates. Those triggers can then be added to jobs.
For all trigger types, the StartOnCreation property will be set to true by default, but developers will have the option to override it.
On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler objects using conditional types.
Schedule triggers are a way for developers to create jobs using cron expressions. We’ll provide daily, weekly, and monthly convenience functions, as well as a custom function that will allow developers to create their own custom timing using the existing event Schedule object without having to build their own cron expressions. (The L2 will extract the expression that Glue requires from the Schedule object). The trigger method will take an optional description and list of Actions which can refer to Jobs or crawlers via conditional types.
3. Notify Event Triggers
Workflows are mandatory for this trigger type. There are two types of notify event triggers, batching and non-batching trigger. For batching triggers, developers must specify
BatchSize
but for non-batchingBatchSize
will be set to 1. For both triggers,BatchWindow
will be default to 900 seconds.4. Conditional Triggers
Conditional triggers have a predicate and actions associated with them. When the predicateCondition is true, the trigger actions will be executed.
Connection Properties
A
Connection
allows Glue jobs, crawlers and development endpoints to access certain types of data stores.*Secrets Management User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property.
Public FAQ
What are we launching today?
We’re launching new features to an AWS CDK Glue L2 Construct to provide best-practice defaults and convenience methods to create Glue Jobs, Connections, Triggers, Workflows, and the underlying permissions and configuration.
Why should I use this Construct?
Developers should use this Construct to reduce the amount of boilerplate code and complexity each individual has to navigate, and make it easier to create best-practice Glue resources.
What’s not in scope?
Glue Crawlers and other resources that are now managed by the AWS LakeFormation team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, in practice these ETL scripts will likely be managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both.
Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region- specific configuration to match shifting feature sets does not out-weigh the likelihood that a developer will use this construct to deploy resources to a region without a particular new feature to a region that doesn’t yet support it without researching or manually attempting to use that feature before developing it via IaC. The developer will, of course, still get feedback from the underlying Glue APIs as CloudFormation deploys the resources similar to the current CDK L1 Glue experience.