Submission: Group 10: Predict and Classify the appearance of criminal incidents based on historical incident reports

Submitting authors: Cassandra Zhang, Ethan Kenny, James He, Pragya Singhal

Repository: https://github.com/DSCI-310-2024/DSCI310-group10-project/releases/tag/v3.0.0

Abstract/executive summary:

Law enforcement agencies worldwide prioritize crime prevention and public safety, traditionally relying on experience and intuition for resource allocation. However, advancements in data analysis now enable a more data-driven approach. This analysis aims to predict the appearance of criminal incidents from time period, day of the week, and police district based on data from San Francisco 2023. Understanding time-related crime patterns can inform proactive policing strategies. By associating time periods, police districts, and days of the week with the appearance of criminal incidents, this study aims to provide a forecasting tool for police patrol scheduling and resource allocation, ultimately enhancing law enforcement activities and public safety.

Editor: @ttimbers

Reviewer: Sri Chaitanya Bonthula, Kevin Yu, Shahrukh Islam Prithibi, Viet Ngo

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer: KevinatorYu

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

The report is good. The research question is very intriguing, and is a perfect example of statistics being applied in policing, whether it is controversial or not. The data is also extremely current, meaning that the analysis results are as up-to-date as possible. The source of the data is also extremely reliable, coming directly from the government of San Francisco. For the analysis, the code coverage of the tests look good, the code looks very easy to follow, and utilizes lots of good practices. (Perhaps the people who coded are currently/recently took CPSC 330? 😄)

However, there are a number of important issues that should be addressed.

There are several errors in the README file. I followed the steps to create a Docker Container. In those steps, you do not require the <> surrounding the URL. Also, your README does not recommend the user to run "make clean" prior to the make all, to remove all the files such that new files can be produced through the Makefile.

The Makefile pipeline does not work. There appeared to be an error in src/analysis.py, a "too many values to unpack" error. This error prevents the analysis to be reproduced.

In the finalized report, there are no authors specified. The report also lack some detail regarding the use of Logistic Regression (as in, why did your team decide to use Logistic Regression, versus something else like KNN classifiers?), and lacks any communication regarding any assumptions or limitations of their results. There also appears to lack some DOIs (ex. the very first reference). The styling of the report starts to feel "funky" near the end, especially the Future Questions section. The Future Questions section appears to be more of a list rather than a concrete paragraph. The impact section could be expanded upon too.

In summary:

In README file, git commands referring to the cloning of the repository to our local machines has a typo (there should not be a <> surrounding the url of the .git repo)
README file does not suggest to "make clean" prior to make all.
Makefile does not work. There is an error performing "python src/analysis.py". "too many values to unpack" error.
No Authors in Quarto document.
Report does not communicate any justifications of using Logistic Regression, versus potential other models such as KNN classifiers
Report does not communicate any assumptions or limitations of their results
No DOI in references that has a DOI (ex. 10.1093/acprof:oso/9780195341966.001.0001, for the first reference)
Report style feels more "funky" by the end, especially the Future Questions section.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: ShahrukhP15

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

The report is easy to follow and the data is sourced from a reliable source, however it is a good practice to include the license of the data used to adhere to the rules. The quarto document is easy to follow and has valuable EDAs to get insight of the data. Also the author's mentioned the process of selection of coefficients with both a table and figure which gives valuable information into why the coefficients were selected. A possible improvement could be mentioning the limitations, suggesting advanced modelling techniques to research performance or addressing the issue with false positives.

However:

Makefile: The make clean function works okay but during the execution of make all, the process exits with a ValueError related to unpacking values in analysis.py which is not allowing the makefile to run.
Function : The function get_time_period checks for integer value, but need to check for invalid hour or minute input like '-1' hour or '61' in minutes and throw type error as they are both integers but outside the range of time. It should enforce an error for any value outside the specific range.
Test : There should be test for the issue mentioned above. Right now for these values it just shows night.
Warnings: There are warning in pytest for test_perform_analysis.py which can be addressed. While these warnings are not errors, it's a good practice to address them to ensure the future compatibility and maintainability of your code.
Though model performance from a set of cross-validations is included, incorporating additional model evaluation metrics such as accuracy, precision, recall, and F1 score would provide a more comprehensive assessment of model performance allowing to understand the model's predictive capabilities and potential limitations more effectively.
An issue the report needs to address is how biasness about a certain time might cause more false positives. It can be addressed using a confusion matrix which provides a clear overview of the model's performance and helps to identify any imbalances or biases in the predictions. The author also needs to explicitly discuss the potential biases and false positives in their analysis.

In short this is a really exciting project and a really intriguing research question. There are a few coding errors which might have been missed. Also I hope you put some emphasis on mentioning the biasness and false positive issue as they are serious potential problem which can cause a waste of resource and money. All in all, it was a really exciting report to read.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: actually-arri

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The project overall seems pretty well put together. It follows a lot of the principles and guidelines we were taught in class. I do find your research question quite interesting. I can see potential areas where you can further strengthen your project. Such as being able to account for data bias (Potential historical law enforcement bias), changes of crime with season and ethical issues with regards to privacy and possibly perpetuating discrimination. I do understand this is very early stages and as such a strong base for the project. Another potential area for improvement would be better commit messages. Some of them were very vague and might cause confusion down the line.

Below are the minor issues I came across:

Community Guidelines: The contributing documents was well made. There could have been more clear resources available for seeking support. There is no easy way of reaching out to the authors currently except for making changes or adding something to the repository.
Lacks References: I do see a couple of references in the references.bib file but none in the Quarto document.
Missing conclusion: There doesn't seem to be any formal conclusion, closing thoughts or limitations presented in the Quarto doc.
Missing Authors: Authors and their affiliations aren't mentioned in the Quarto doc
Methods: The Quarto report has about 2 figures and 3 tables. There seems to be a lot more room for improvement with the kind of figures and tables that could be reported. Getting some more insight into variable selection or how different models (If they tried multiple) performed. There could also be more figures exploring the selected variables.
Automation issue: I was not able to generate the reports using makefile. I had to go off the reports provided in the repo

In summary, very well done project with a motivating research question. Apart from minor nitpicks I think you have applied what you have learned well and will go on to lead some amazing projects and teams. Good luck for the final :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024