aimanamri / yellow-taxi-trips-etl-data-engineering-project

1 stars 0 forks source link
azure data-engineering etl-pipeline jupyter-notebook

Yellow Taxi Trips Data Analytics | Data Engineering Azure Project

GitHub Language Count GitHub Top Language GitHub Stars GitHub Last Commit Repository Size

Introduction

The "Yellow Taxi Trips Data Analytics" project uses modern technology and data analysis to extract valuable insights from New York City's yellow taxi trip records. I'm employing a range of advanced tools like Python, SQL, Azure services, and Power BI to process, analyze, and visualize the data.

Architecture

Technologies Used

  1. Python
  2. SQL
  3. Azure Data Factory
  4. Azure Data Bricks
  5. Azure Synapse Analytics
  6. Power BI

Dataset Used

  1. Source : https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  2. Data Dictionary : https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

The data is separated by months for each year, so I created a simple Python script to download all the Parquet files and combine them by year. The dataset is stored in .parquet.gzip format to be cost-effective for storage. But since it were too large to be stored on GitHub (without Git LFS), reducing the file size and using CSV/Parquet format is the best solution by filtering the rows for this side project use. Here, first 20,000 rows randomly selected from each month will be used.

Data Model

Insights