airscholar / e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
https://www.youtube.com/watch?v=GqAcTrqKcrY
136 stars 62 forks source link
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

System Architecture

The project is designed with the following components:

What You'll Learn

Technologies

Getting Started

  1. Clone the repository:

    git clone https://github.com/airscholar/e2e-data-engineering.git
  2. Navigate to the project directory:

    cd e2e-data-engineering
  3. Run Docker Compose to spin up the services:

    docker-compose up

For more detailed instructions, please check out the video tutorial linked below.

Watch the Video Tutorial

For a complete walkthrough and practical demonstration, check out our YouTube Video Tutorial.