NihaRGali / niha_606

This is my final project
0 stars 2 forks source link

PROJECT REPORT :

Topic :

Understanding the contributing factor for vehicle crashes dataset of New York city

Historical Background :

Every day, people are gravely injured in car accidents. According to the US Department of Transportation, 943 persons were killed in car accidents in New York state in 2018. This equates to about three persons per day. Many of these mishaps could have been avoided totally. Our personal injury lawyers at Rosenbaum & Rosenbaum, P.C. feel that understanding New York City vehicle accident data is critical to keeping people safe on the roadways. According to the New York Police Department, there were 7,456 motor vehicle accidents in June 2020 alone–at a time when the entire city remained largely shut down in response to the COVID-19 pandemic.

NYPD data provides a monthly statistical analysis of the most common causes of NYC car accidents. Some of the common causes of NYC car accidents for June 2020 include:

Problem Statement :

Understanding the vehicle crashes dataset of New York and develop a multiclass classifier to classify the contributing factor for those crashes

About the data :

The Motor Vehicle Collisions vehicle table contains details on each vehicle involved in the crash.Each row represents a motor vehicle involved in a crash. The data in this table goes back to April 2016(2016-2021) when crash reporting switched to an electronic system.

The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.

Data Source :

Data Description and Overview :

Column Name Description

Questions for Analysis :

Analysis :

Data Cleaning and Engineering :

Observations :

EDA :

Observations :

Plan for ML :

Models and Approach :

Selecting the right model is a big challenge in machine learning. Since my problem comes under predicting the input variables, I am interested in using the below algorithms to achieve good performance.

Developing performance of models :

Based on the type of machine learning problem, classification, clustering and regression, various statistics and visualizations are generated including accuracy, confusion matrix, receiver operating characteristic (ROC) curve, cluster distortion, and means squared error (MSE). I would like to implement the avove stated model and based on the accuracy, recall and performance i would pick the best model that suits the data.

Summary and Future Scope

I would use the model that identifies the less frequently occuring classes more accurately compared to the complex algorithms.For the future of the project, I would like to identify various data sources that would help us look at a bigger picture of what the major contributing factors of the accident are and merge them with the existing dataset. Also, build machine learning models that could more accurately classify the contributing factor for an accident.