cloudera-labs / envelope

Build configuration-driven ETL pipelines on Apache Spark
Apache License 2.0
158 stars 89 forks source link

Envelope

Envelope is a configuration-driven framework for Apache Spark that makes it easy to develop Spark-based data processing pipelines.

Envelope is simply a pre-made Spark application that implements many of the tasks commonly found in ETL pipelines. In many cases, Envelope allows large pipelines to be developed on Spark with no coding required. When custom code is needed, there are pluggable points in Envelope for core functionality to be extended. Envelope works in batch and streaming modes.

Some examples of what you can easily do with Envelope:

Get started

Requirements

Envelope requires Apache Spark 2.1.0 or above.

Additionally, if using these components, Envelope requires:

For Cloudera CDH 5, Kafka requires Cloudera's Kafka 2.1.0 or above, HBase and ZooKeeper requires CDH 5.7 or above, and Impala requires CDH 5.9 or above. For Cloudera CDH 6, any CDH 6.0 or above is required.

Downloading Envelope

Envelope and its dependencies can be downloaded as a single jar file from the GitHub repository Releases page.

Compiling Envelope

Alternatively, you can build the Envelope application from the top-level directory of the source code by running the Maven command:

mvn clean install

This will create envelope-0.7.0.jar in the build/envelope/target directory.

Finding examples

Envelope provides these example pipelines that you can run for yourself:

Running Envelope

You can run Envelope by submitting it to Spark with the configuration file for your pipeline:

spark-submit envelope-0.7.0.jar your_pipeline.conf

Note: CDH5 uses spark2-submit instead of spark-submit for Spark 2 applications such as Envelope.

A helpful place to monitor your running pipeline is from the Spark UI for the job. You can find this via the YARN ResourceManager UI.

More information

If you are ready for more, dive in: