dimajix / flowman

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
https://flowman.io
Apache License 2.0
92 stars 19 forks source link
apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql

Flowman Logo Flowman

The declarative data build tool based on Apache Spark.

License Documentation Build

🤔 What is Flowman?

Flowman is a data build tool based on Apache Spark that simplifies the act of implementing data transformation logic as part of complex data pipelines. Flowman follows a strict "everything-as-code" approach, where the whole transformation logic is specified in purely declarative YAML files. These describe all details of the data sources, sinks and data transformations. This is much simpler and efficient than writing Spark jobs in Scala or Python. Flowman will take care of all the technical details of a correct and robust implementation and the developers can concentrate on the data transformations themselves.

In addition to writing and executing data transformations, Flowman can also be used for managing physical data models, i.e. Hive or SQL tables. Flowman can create such tables from a specification with the correct schema and also automatically perform migrations. This helps to keep all aspects (like transformations and schema information) in a single place managed by a single tool.

Flowman Diagram

💪 Noteable Features

💾 Supported Data Sources and Sinks

Flowman supports a wide range of data sources, for example

For file-based sources and sinks, Flowman supports commonly used file formats like CSV, JSON, Parquet and much more. The official documentation provides an overview of supported connectors.

📚 Documentation

You can find the official homepage at Flowman.io and a comprehensive documentation at Read the Docs.

🤓 How do I use Flowman?

1. Install Flowman

You can set up Flowman by following our step-by-step instructions for local installations or by starting a Docker container

2. Create a Project

Flowman will provide some example projects in the examples subdirectory, which you can use as a starting point.

3. Execute the Project

You can execute the project interactively by starting the Flowman Shell

🚀 Installation

You simply grab an appropriate pre-build package at GitHub, or you can use a Docker image, which is available at Docker Hub. More details are described in the Quickstart Guide or in the official Flowman documentation.

🏗 Building

You can build your own Flowman version via Maven with

mvn clean install

Please also read BUILDING.md for detailed instructions, specifically on build profiles.

💙 Community

😍 Contributing

You want to contribute to Flowman? Welcome! Please read CONTRIBUTING.md to understand how you can contribute to the project.

📄 License

This project is licensed under Apache License 2.0 - see the LICENSE file for details.