Ozxahmed / dec-proj1-chicago-crime

0 stars 0 forks source link

DEC Project 1 - Chicago Crime

Project Context and Goals

The goal of this project was to create an end-to-end pipeline that will extract data from a constantly-updating (live) dataset and successfully load it to a relational database, imitating a real-life data warehouse. Transformations on the data would be done within the pipeline prior to loading (ETL), and also after loading (ELT), to simulate potential data manipulation requirements of real-time applications.

The data should be loaded in an easy-to-use format so as to allow the data analysts(DA)/data scientists(DS) to immediately begin working with it without having to commit a sizeable amount of time to data manipulations. This means that the data should be loaded onto several

inct tables that can easily be merged with each other and that contains intuitive column names and data types for each distinct record.

The DA/DS person should not be concerned with the accuracy of each data record and the process by which each data record is inserted or updated within the database. The pipeline should abstract and automate all of these processes as much as possible.

Business Objective

The objective of our project is to provide analytical datasets from our Chicago crime API and supporting police, ward and holiday databases. The datasets are clean, regularly updated, and supported with several custom table views generated from these datasets.

Consumers

The users of our data are Chicago city budget and policy analysts and officials. Our data is useful for getting up-to-date information on crimes that occur throughout Chicago. Our users would want to access the data primarily through sql queries and the custom sql views from the database. This data can also be used to create dashboards with maps and metrics.

Questions

There are three general question categories our data addresses: broad questions about crime in the city of Chicago, questions about crime incidents during holidays, and crime incidents within police districts.

Examples of these questions include but are not limited to:

Our data will allow our users to make resourcing and policy decisions based on historical and newly generated data.

Source datasets

The chosen live dataset for the project was the Chicago Crimes One Year Prior to Present which stores all reported crime incidents in the city of Chicago, Illinois. This public dataset provides the following advantages:

In addition to the Chicago Crimes data, we used static datasets for 2023 and 2024 holidays, Chicago police stations and Chicago ward offices.

Solution Architecture Diagram

Below is the solution architecture description and diagram, illustrating the key components and their interactions in our project.

DEC Project 1 Architecture

ELT/ETL Techniques Applied

We employed Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) techniques to efficiently process and transform raw data into a format suitable for analysis.

Extraction Pattern

We are using a live dataset that updates periodically (6 days/week). Our pipeline first checks if the database exists. If the database doesn't exit, for the first time the code runs, the pipeline extracts the data one week at a time, based on the date_of_occurrence field, until all data has completed the ETL process and has been loaded into our database, which is hosted and managed on AWS RDS. This serves as a backfill of the database. If the database does exist, the pipeline identifies the max updated_at field in the database and extracts data starting from that date to today's date. The extraction pipeline is scheduled to run daily to check if data has been updated.

Data Transformation Patterns

ETL

There are two places we use data transformation patterns. The first is after the extracion of the crime data and import of the .csv data. We used Pandas to drop columns, change column names and generate the holiday dataframe.

ELT

The second set of transformations happens after the data has been loaded into the database. We use sql templates to generate views in the database. These transformation include CTEs, joining, grouping, sorting, and aggregation function. The SQL transformations result in several table views in the database. Our ERD diagram for the tables and views can be seen below:

DEC Project 1 Architecture

Data Loading Patterns

Our pipeline extracts, transforms and loads one weeks worth of data at a time until the database has been completely backfilled.

Data Flow Chart

For more details on project data flow, please see the Chicago Crime Project Flowchart pdf.

Limitations and Lessons Learned

Throughout the course of the project, we encountered various challenges and gained valuable insights. Here are some key lessons learnt during the project:

Installation and Running Instructions

1. Install PostgreSQL and pgAdmin

Download PostgreSQL and pgAdmin by going to https://www.postgresql.org/download/ and selecting the installer for your Operating System. Download the latest version of PostgreSQL.

After downloading the installer, run the installer on your computer.

Select all options for when promoted for the components you wish to install.

images/postgres-component.png

Enter a password:

images/postgres-password.png

Enter a port number. Note: The default port number 5432 is usually used when using PostgreSQL.

When installed successfully, you should be able to find pgAdmin 4 in your applications.

images/postgres-success.png

2. Clone Repository

Clone the github repository using the options from the < > Code drop down menu here:

clone repo button

3. Get App Token

In order to execute the pipeline, an App Token is needed to send requests through the API. To obtain the App Token, please follow the link here.

4. Create .env file

Create an .env file in the main directory with the following format:

# City of Chicago App Token
APP_TOKEN=<App Token>

# db
DB_USERNAME=<postgresql db username>
DB_PASSWORD=<postgresql db password>
SERVER_NAME=<servername>  # usually `localhost`
DATABASE_NAME=<name of db>
PORT=<port>  # usually 5432

5. Run pipeline

Once the .env file is created with the relevant App Token and database information, the pipeline can be ran using python. Assuming python is installed on your system, type the following command into the terminal from the directory:

python -m etl_project.pipeline

Depending on the system it will take at least 5 mins to run. On the first run the pipeline will incrementally backfill the database with all the available crime records from the prior year. After that, on subsequent runs, it will upsert data based on the currently stored max updated_at date.


AWS Screenshots

Docker was used to containerize the pipeline, which was then loaded into AWS ECR, and ran on AWS ECS. Below are screenshots of this:

AWS ECS

Screenshot of task running:

ECS task running

Screenshot of logs:

ECS logs

AWS ECR

Screenshot of image in ECR repository:

ECR

AWS RDS

Screenshot of database in RDS:

RDS

S3

Screenshot of .env file in S3 which was used in the task definition that ran the pipeline in ECS:

S3