UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 21: Motor Vehicle Collision Fatality Predictor #29

Open danfke opened 2 years ago

danfke commented 2 years ago

Submitting authors: @iamMoid @SiqiTao @gn385x @danfke

Repository: https://github.com/UBC-MDS/Collision_Prediction Report link: https://github.com/UBC-MDS/Collision_Prediction/blob/main/doc/collision_prediction_report.md Abstract/executive summary:

In this project we attempt to build a classification model using the logistic regression algorithm and data obtained from police-reported motor vehicle collisions on public roads in Canada to predict whether a motor vehicle collision would result in a fatality or not. The final model performed poorly on both the training set and the test set, returning a high recall of 0.698, but a very low precision of 0.048, resulting in a low f1-score of 0.09. The impact of the low precision can be seen in the results of the prediction of the test set, where the model incorrectly predicts fatalities around 20 times more than it correctly predicts fatalities.

The data set that was used in this project came from the National Collision Database, published by Transport Canada. The National Collision Database contains data on all of the police-reported motor vehicle collisions on public roads in Canada from 1999 to the most recent available data from 2017. We ran our analysis using the data collected from collisions that occurred in 2017. This data set contains information licensed under the Open Government Licence – Canada.

Editor: @flor14 Reviewer: PUGHAZHENDHI_GAUTHAM, Ahn_Kyle, Fairbrother_Gabriel, Wang_Joyce

gfairbro commented 2 years ago

Data analysis review checklist

Reviewer: @gfairbro

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5H

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Particularly Well:

  1. I thought the data set selected was interesting and more ambitious. Many groups stuck to the UCI, so branching out and finding something different is a real plus. I also liked that they used a technique we haven't covered in class (Undersampling).

  2. Their report was clear and easy to read and understand.

  3. Their code is also quite well written, neat and easy to follow.

Improvements:

  1. The inclusion of an environment file is a nice touch, but using the --from-history flag when generating this would allow for people to use it in different OS environments. As far as i can tell this is from a windows environment. directions in Usage on how to install would be helpful as well.

  2. It appears the source data is no longer available (at least temporarily). Adding it to the repo or providing a source and making note of that would be helpful.

  3. Not having a data directory skeleton means the scripts failed, so including them (even empty) would be a good idea. I see that the script should try to make them but it didnt work for me until i created the path. maybe because it is 2 levels?

  4. The environment is missing something for the EDA in windows. I know this because our project is too! in order to output PNG in windows from altair you need to run the following, and it doesnt seem to get captured properly when you export a environment.yaml file: npm install -g vega vega-cli vega-lite canvas (see lecture 2 from 531 for details - scroll down to the warning): https://pages.github.ubc.ca/mds-2021-22/DSCI_531_viz-1_students/lectures/2-data-types_graphical-marks_visual-encondings.html#global-development-data

I had this error and can confirm it fixed it for your project

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

gauthampughazhendhi commented 2 years ago

Data analysis review checklist

Reviewer: @gauthampughaz

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Positives:

Improvements:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jo4356 commented 2 years ago

Data analysis review checklist

Reviewer: @jo4356

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. I wasn't able to download the data. Seems like the data is removed?
  2. The EDA is very detailed and nicely explained. You clearly explained the thought process of going from one part of the EDA to the next.
  3. The eda.py script has everything inside if __name__ == "__main__":. It might be a bit more readable and consistent with a main function, since all of your other scripts have a main function.
  4. I agree with the previous reviewer that the code can be broken down into smaller functions, maybe based on different steps of the analysis.
  5. Adding to the previous point, having doctrings to the smaller functions, and moving all the comments inside the code to the docstrings, would make the code much more readable.
  6. The final report is well organized, and I like that you described how you can further improve on your model.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

AraiYuno commented 2 years ago

Data analysis review checklist

Reviewer: @AraiYuno

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Positives
Improvements
Overall, I would give 9.8 out of 10!! Near perfect repo! Great work team 21!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

danfke commented 2 years ago

Thank you everyone for the great feedback! We've addressed many of the points provided, including the following:

  1. In regards to comment number 2 of @gfairbro's review (an issue which was also mentioned in subsequent reviews)we addressed this by linking to another URL containing the same data as per Tiffany’s directions: 503c540.

  2. In regards to comment number 4 of @gfairbro's review we added a line in the usage section of our README where it explains that windows users may have to run an additional command prior to running the make file in order to properly render PNG's: 630613e

We also added it to our Dockerfile so that it runs properly: 4d287a5

  1. In regards to comment number 3 of @gauthampughaz's review we added assert statements to all of the scripts that did not already have some form of testing: c1731a3 677fad9 45414ce 3465f96

  2. In regards to comment number 3 of @jo4356's review we addressed this here: 75d1262