Azure / config-driven-data-pipeline

Creative Commons Attribution 4.0 International
38 stars 18 forks source link

Config-Driven Data Pipeline

pypi

Why this solution

This repository is to illustrate the basic concept and implementation of the solution of config-driven data pipeline. The configuration is a JSON file that contains the information about the data sources, the data transformations and the data curation. The configuration file is the only file that needs to be modified to change the data pipeline. In this way, even business users or operation team can modify the data pipeline without the need of a developer.

This repository shows a simplified version of this solution based on Azure Databricks, Apache Spark and Delta Lake. The configuration file is converted into Azure Databricks Job as the runtime of the data pipeline. It targets to provide a lo/no code data app solution for business or operation team.

Background

medallion architecture

This is the medallion architecture introduced by Databricks. And it shows a data pipeline which includes three stages: Bronze, Silver, and Gold. In most data platform projects, the stages can be named as Staging, Standard and Serving.

Azure

The above shows a typical way to implement a data pipeline and data platform based on Azure Databricks.

Architecture

Architecture

Inspired by Data Mesh, we try to create a solution to accelerate the data pipeline implementation and reduce the respond time to changing business needs, where we’d like to help business team can have the ownership of data application instead of data engineers, who could focus on the infrastructure and frameworks to support business logic more efficiently.

The configurable data pipeline includes two parts

Example

Here is an example to show how to use the framework and configuration to build a data pipeline.

We need to build a data pipeline to calculate the total revenue of fruits.

PoC

There are 2 data sources:

The configuration file describes the pipeline.

In the pipeline, it includes the 3 blocks:

The staging block defines the data sources. The standardization block defines the transformation logic. The serving block defines the aggregation logic. Spark SQL are used in the standardization block and the serving block, one is merge price and sales data and the other is for aggregation of the sales data.

Run the batch mode pipeline in local PySpark environment:

python src/main.py --config-path ./example/pipeline_fruit_batch.json --working-dir ./tmp --show-result True --build-landing-zone True --cleanup-database True

Here is another example of streaming based data pipeline.

Run the streaming mode pipeline in local PySpark environment:

python src/main.py --config-path ./example/pipeline_fruit_streaming.json --working-dir ./tmp --await-termination 60 --show-result True  --build-landing-zone True --cleanup-database True

After running the pipeline, the result will show in the console.

id fruit total
4 Green Apple 45.0
7 Green Grape 36.0
5 Fiji Apple 56.0
1 Red Grape 24.0
3 Orange 28.0
6 Banana 17.0
2 Peach 39.0

Reference

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.