NihaRGali/niha_606 - Githubissues

PROJECT REPORT :

Link to my presentation : https://drive.google.com/file/d/1-ZGlZQWmUhD38TFY8Y-CtWKrDBViLWyG/view?usp=sharing
Link to my video : https://youtu.be/CyxAzJt6Yls

Topic :

Understanding the contributing factor for vehicle crashes dataset of New York city

Historical Background :

Every day, people are gravely injured in car accidents. According to the US Department of Transportation, 943 persons were killed in car accidents in New York state in 2018. This equates to about three persons per day. Many of these mishaps could have been avoided totally. Our personal injury lawyers at Rosenbaum & Rosenbaum, P.C. feel that understanding New York City vehicle accident data is critical to keeping people safe on the roadways. According to the New York Police Department, there were 7,456 motor vehicle accidents in June 2020 alone–at a time when the entire city remained largely shut down in response to the COVID-19 pandemic.

NYPD data provides a monthly statistical analysis of the most common causes of NYC car accidents. Some of the common causes of NYC car accidents for June 2020 include:

Driver distraction (2,127 crashes)
Following too closely (577 crashes)
Failure to yield the right of way (484 crashes)
Improper passing or lane usage (311 crashes)
Speeding (293 crashes)
Drunk driving or illegal drug use (138 crashes)

Problem Statement :

Understanding the vehicle crashes dataset of New York and develop a multiclass classifier to classify the contributing factor for those crashes

About the data :

The Motor Vehicle Collisions vehicle table contains details on each vehicle involved in the crash.Each row represents a motor vehicle involved in a crash. The data in this table goes back to April 2016(2016-2021) when crash reporting switched to an electronic system.

The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.

Data Source :

https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Vehicles/bm4k-52h4
Size : 114.4MB
Rows : 3.7M
Columns : 25
Each row represents motor vehicle invloved in a crash.

Data Description and Overview :

Column Name Description

UNIQUE_ID Unique record code generated by system.
COLLISION_ID Crash identification code. Foreign Key, matches unique_id from the Crash table.
CRASH_DATE Occurrence date of collision
CRASH_TIME Occurrence time of collision
VEHICLE_ID Vehicle identification code assigned by system
STATE_REGISTRATION State where vehicle is registered.
VEHICLE_TYPE Type of vehicle based on the selected vehicle category
VEHICLE_MAKE Vehicle make
VEHICLE_MODEL Vehicle model
VEHICLE_YEAR Year the vehicle was manufactured
TRAVEL_DIRECTION Direction vehicle was traveling
VEHICLE_OCCUPANTS Number of vehicle occupants
DRIVER_SEX Gender of driver
DRIVER_LICENSE_STATUS License, permit, unlicensed
DRIVER_LICENSE_JURISDICTION State where driver license was issued.
PRE_CRASH Pre-crash action: Going straight, making right turn, passing, backing, etc.
POINT_OF_IMPACT Location on the vehicle of the initial point of impact
VEHICLE_DAMAGE Location on the vehicle where most of the damage occurred
VEHICLE_DAMAGE_1 Additional damage locations on the vehicle
VEHICLE_DAMAGE_2 Additional damage locations on the vehicle
VEHICLE_DAMAGE_3 Additional damage locations on the vehicle
PUBLIC_PROPERTY_DAMAGE Public property damaged (Yes or No)
PUBLIC_PROPERTY_DAMAGE_TYPE Type of public property damaged (ex. Sign, fence, light post, etc.)
CONTRIBUTING_FACTOR_1 Factors contributing to the collision for designated vehicle
CONTRIBUTING_FACTOR_2 Factors contributing to the collision for designated vehicle

Questions for Analysis :

Understandig the contributing factor
- Found out that it has these divisions
- Unspecified 2121223
- Driver Inattention/Distraction 441483
- Failure to Yield Right-of-Way 122919
- Following Too Closely 113854
- Other Vehicular
To find the extent of property damage
- Found out that more than 90% of the data has 'No' that means in more than 90% of crashes there was no property damage.
What part f vehicle was damaged moslty ?
- Most of the damages have occured at the front end of the car compared to other parts.
What vehicles were involved in the accident ?
- Most of the damages have occured at the front end of the car compared to other parts.
Identify drivers license and sex distribution
- About 95% of people who were involved in an accident are licensed. Most of the damages have occured at the front end of the car compared to other parts.
Accident distribution over time
- I observed that during the years 2017, 2018 we have an average of 175000 accident.
- There is a decrease in the number of accidents in 2020 and 2021 due to the pandemic.

Analysis :

Data Cleaning and Engineering :

Observations :

There are very few rows like unique_id, collision_id, date and time with no missing data.
Columns such as VEHICLE_MODEL, VEHICLE_DAMAGE_1, VEHICLE_DAMAGE_2, VEHICLE_DAMAGE_3, PUBLIC_PROPERTY_DAMAGE_TYPE have more than 65% of missing values.
Reduced the number of rows with values 'Unspecified' by 49.8%.
More than 90% of the data has 'No' that means in more than 90% of crashes there was no property damage.
Most of the damages have occured at the front end of the car compared to other parts.
A great portion of cars that were involved in an accident were either going straight or were parked.
About 95% of people who were involved in an accident are licensed.
Majority of accidents were caused by Males.
In 80% of the cases there is only one person in the vehicle which was involved in an accident.
Droping the unnecessary columns
- UNIQUE_ID, COLLISION_ID, and VEHICLE_ID as they are unique and don't add value for this analysis.
- CRASH_DATE, CRASH_TIME, VEHICLE_YEAR, CONTRIBUTING_FACTOR_2, and VEHICLE_MAKE as we extracted the main information from them.
- VEHICLE_MODEL, PUBLIC_PROPERTY_DAMAGE, PUBLIC_PROPERTY_DAMAGE_TYPE as most of the data is missing.
- STATE_REGISTRATION, DRIVER_LICENSE_STATUS, TRAVEL_DIRECTION, and DRIVER_LICENSE_JURISDICTION as data is highly skewed
- VEHICLE_DAMAGE_cleaned as POINT_OF_IMPACT and VEHICLE_DAMAGE_cleaned very highly correlated.

EDA :

Observations :

I observed that during the years 2017, 2018 there was an average of 175,000 accident.
There is a decrease in the number of accidents in 2020 and 2021 due to the pandemic.
I observed that their was a sudden spike in the number of accidents in the months of May and June.
Contrast to common conception most of the accidents took place in the afternoon.
Lack of attention was the major contributor for accidents followed by following too closely, not DUI or DWI.
For the accidents having damages on the front end the driver was going straight.
For the accidents having damages on the back end the driver was mostly backing the car.
When there is only one person in the vehicle lack of attention and over speeding were the causes for accidents.
Compared to females, men cause a lot of accidents due to distraction.

Plan for ML :

Categorical columns : 'VEHICLE_TYPE','DRIVER_SEX','POINT_OF_IMPACT','MAKE','Month','Week','Hour'
Numerical columns : 'how_old'
x(independent variables) : 'VEHICLE_TYPE','DRIVER_SEX','POINT_OF_IMPACT','MAKE','Month','Week','Hour','how_old'
y(dependent variable) : 'CONTRIBUTING_FACTOR_1'

Models and Approach :

Selecting the right model is a big challenge in machine learning. Since my problem comes under predicting the input variables, I am interested in using the below algorithms to achieve good performance.

Logistic regression
Decision tree
Random forest
Gradient Boosting Classifier
Ada Boost Classifier

Developing performance of models :

Based on the type of machine learning problem, classification, clustering and regression, various statistics and visualizations are generated including accuracy, confusion matrix, receiver operating characteristic (ROC) curve, cluster distortion, and means squared error (MSE). I would like to implement the avove stated model and based on the accuracy, recall and performance i would pick the best model that suits the data.

Summary and Future Scope

I would use the model that identifies the less frequently occuring classes more accurately compared to the complex algorithms.For the future of the project, I would like to identify various data sources that would help us look at a bigger picture of what the major contributing factors of the accident are and merge them with the existing dataset. Also, build machine learning models that could more accurately classify the contributing factor for an accident.