UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 24: Crime Prediction in Vancouver #8

Open jasmineortega opened 2 years ago

jasmineortega commented 2 years ago

Submitting authors: @thomassiu, @sy25wang, @RamiroMejia, @jasmineortega

Repository: https://github.com/UBC-MDS/DSCI_522_Crime_Prediction_Vancouver Report link: https://github.com/UBC-MDS/DSCI_522_Crime_Prediction_Vancouver/blob/main/doc/vancouver_crime_predict_report.md

Abstract/executive summary: In this project, we attempted to create a classification prediction model to predict the types of crimes that happens in Vancouver, BC based on neighborhood location and time of the crime. Based on our EDA results and model tuning, including necessary data cleaning tasks, we identified that the Logistic Regression model performed the best among all the models tested based on f1 score. The performance of predicting the results of the unseen data was not satisfied, that we believed was due to the lack of associations between the features (Time and Location) and crime type. We proposed further improvements in the future iterations of model optimisations, such as including adding relevant data from outside score (i.e. Vancouver weather, Vancouver housing etc).

Editor: @thomassiu, @sy25wang, @RamiroMejia, @jasmineortega Reviewer: @lipcai, @arijc76, @zzhzoe, @junrongz

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

arijeetchatterjee commented 2 years ago

Data analysis review checklist

Reviewer: @arijc76

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Nice work. I liked this analysis and it's unfortunate that the quality of the available data is not conducive to getting the desired results. I can think of the following as improvements on the work done so far:

Add tests for the script files.
Specify the version numbers for tidyverser and knitr (Based on personal experience, the version number will be useful for any user with a older version to diagnose any potential issues in running the analysis).
I was unable to run the automation script to reproduce the analysis. Got the below error after following the instructions to install the conda environment and run the script to execute the data analysis pipeline.

Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted

As per stackoverflow, this error ccould be solved by inserting the below code in the R Script prior to the render command. You can investigate this further and incorporate any changes in the installation or data analysis pipeline running instructions.

Sys.setenv(RSTUDIO_PANDOC="--- insert directory here ---")

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lipcai commented 2 years ago

Reviewer: @lipcai

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Good job! The structure of the repo is clean and organized. The research topic is interesting! the scripts and the report are very well-structured. Please find my comments in the following.

Points being done well:

The model could be very useful for predicting the size of the forest fires and could potentially guide the rescue forces in the realistic context as needed.
The data visualizations from the EDA part is formatted beautifully. The scripts, EDA, and coding are well designed and fully described.
The source code was well separated into meaningful functions / modules.

Points could be improved:

I think you make the README file pretty clear so I can reproduce the entire data analysis. I wolud suggest to add a flow chart / workflow diagram in README so the readers can have a better understanding of the overall data flow with adequate graphical representation.
You guys have a very detailed data EDA, but for those interested in the raw data, I would hope in the final report, Data Section can be added to give a summary about the dataset (e.g. the number of columns, the type or description of the columns and total observations) that is used for the analysis of this project.
I think you guys still need to add tests for the script files. it is one of the metrics standard.
In src folder, there are too many file in just the root, it would be better to seperate and arrange those file into more appropriate ones.
As what we have been taught in 531 (data visuaiazation), there could add a narrative so that it will be easy for others to follow along with what you have done here .I am sorry, this one is picky(I really can't find another one but it demands at least 5-point constructive feedback)

Again, great work! It's really hard for me to pick out other points that need to be improved and I got some great ideas for our project after reading yours! Thank you! Linhan

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zjr-mds commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance of this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Well done guys! Thanks for the impressive project, I really enjoyed reviewing it:) Here are some detailed suggestions, and hope they can be helpful for any improvements!

EDA Figure 2 summary part: Considering the timeline with this figure (crime evolution from 2016 to 2020), it might be inaccurate to say the steep increase of theft from vehicles in 2018 is related to the start of COVID. --- relevant text from EDA for potential edition: "This may be due to the start of Covid that causes a series of social problems."
Prediction Report Figure 2 I would like to switch the x and y-axis variables so that it's easier for the audience to read the name of these neighbourhoods.
I noticed that you have different versioned predict report, EDA (.ipynb, .Rmd, etc), and README; but it should be okay to remove some of the versions to avoid having too many repetitive files. For example, keeping only .html and .md version predict report, keep the most informative EDA (I have noticed that .ipynb EDA actually has additional and fancier visualization data than the .Rmd and .md files)
Name of 'crime_vancouver_eda.py' might need a change; looks like it's the plot generating script instead of the EDA report and this could be confusing since the name is the same as the above EDA report's names (except for the suffix).
Inconsistent tests of the functions: exception handling was implemented for pre-processed data script but missing for some of the functions in modelling script.
Also, I'm not sure if you're still working on the Makefile since it's not the due date yet, but I was running into this issue below
```
make: Nothing to be done for `all'.$ make all
make: Nothing to be done for `all'.
```
when I run make all from the terminal. I had the same error before and it was fixed after I correct the indentation. Just a reminder to double-check this.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zzhzoe commented 2 years ago

Data analysis review checklist

Reviewer: <@zzhzoe>

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Very well done guys, the research topic is well introduced and your report had a clear structure to follow along. I was very engaged reading your project overall. Please find my comments below.

Strength:

Great presentation of your data and explanation of your model to demonstrate that they are tailoring to your goal, which is to predict types of crimes based on neighborhood location and time.
A good variety of data visualization was carefully chosen to clearly deliver the takeaway from the data analysis.

Suggestions:

Your conda environment is good but when I ran the Makefile, the following error message appears.

Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted.
make: *** [src/crime_vancouver_eda.md] Error 1

I would suggest you add tests for the script files. It is a small but important step before you run the code.
It would be more clear if you can add in the EDA section some information on which dataset and what characteristics you used to produce the EDA visualizations. Really great EDA section but just not enough information on the dataset used.
I would suggest you organize your repo folders more. The current layout is a bit confusing with different versions of files unlabeled. I would suggest only keeping the necessary files that can be used to trace the EDA presented in the report.
It would be helpful if you can disclose your Makefile code in the usage section, It's not a red flag by any means, but just a nice-to-have quick improvement. Same as the dataset link can be add in the report data section.
Add npm install -g vega vega-cli vega-lite canvas to environment instruction. Because different user may encounter JSON decoder error. My group has this problem for one of reviewer in our group.

Overall, very well done! Such an interesting project and you clearly delivered well. Minor suggestions and lots of good things to learn from.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

RamiroMejia commented 2 years ago

Thank you for your comments!

We really appreciate your feedback. We made the following changes regarding your comments:

Regarding Arijeet 's feedback on issue , we added test.py that contains tests for functions used in the script files. Here is the commit c4d25b3
We added instructions to the README.md file on how to solve the pandoc version issue. The changes are here: cf275a3
As per Li Cai's suggestion we reorganized the /src folder: f2c116e
Regarding the issue, we moved all the notebooks to /raw folder: e8fd231
All the files are in the repository now, regarding the issue, the Makefile was added: 71dcef7

sy25wang commented 2 years ago

Data analysis review checklist

Reviewer: @arijc76

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?

[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?

[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

[x] Style guidelides: Does the code adhere to well known language style guides?

[x] Modularity: Is the code suitably abstracted into scripts and functions?

[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?

[x] Computational methods: Is all the source code required for the data analysis available?

[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?

[x] What is the question: Do the authors clearly state the research question being asked?

[x] Importance: Do the authors clearly state the importance for this research question?

[x] Background: Do the authors provide sufficient background information so that readers can understand the report?

[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?

[x] Conclusions: Are the conclusions presented by the authors correct?

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Nice work. I liked this analysis and it's unfortunate that the quality of the available data is not conducive to getting the desired results. I can think of the following as improvements on the work done so far:

Add tests for the script files.

Specify the version numbers for tidyverser and knitr (Based on personal experience, the version number will be useful for any user with a older version to diagnose any potential issues in running the analysis).

I was unable to run the automation script to reproduce the analysis. Got the below error after following the instructions to install the conda environment and run the script to execute the data analysis pipeline.
Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted
As per stackoverflow, this error ccould be solved by inserting the below code in the R Script prior to the render command. You can investigate this further and incorporate any changes in the installation or data analysis pipeline running instructions.
Sys.setenv(RSTUDIO_PANDOC="--- insert directory here ---")
Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi Arijeet,

Thank you for your review! We improved our project by the following

Tests have been added for scripts. Please refer to /src/scripts/tests.py
Environment file has been updated to reflect the latest required packages (including version requirements)
Please refer to README instructions on how to handle pandoc error

If you have more questions or concerns, please kindly let us know.

sy25wang commented 2 years ago

Reviewer: @lipcai

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?

[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?

[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

[x] Style guidelides: Does the code adhere to well known language style guides?

[x] Modularity: Is the code suitably abstracted into scripts and functions?

[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?

[x] Computational methods: Is all the source code required for the data analysis available?

[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?

[x] What is the question: Do the authors clearly state the research question being asked?

[x] Importance: Do the authors clearly state the importance for this research question?

[x] Background: Do the authors provide sufficient background information so that readers can understand the report?

[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?

[x] Conclusions: Are the conclusions presented by the authors correct?

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Good job! The structure of the repo is clean and organized. The research topic is interesting! the scripts and the report are very well-structured. Please find my comments in the following. #Points being done well:

The model could be very useful for predicting the size of the forest fires and could potentially guide the rescue forces in the realistic context as needed.

The data visualizations from the EDA part is formatted beautifully. The scripts, EDA, and coding are well designed and fully described.

The source code was well separated into meaningful functions / modules.

Points could be improved:

I think you make the README file pretty clear so I can reproduce the entire data analysis. I wolud suggest to add a flow chart / workflow diagram in README so the readers can have a better understanding of the overall data flow with adequate graphical representation.

You guys have a very detailed data EDA, but for those interested in the raw data, I would hope in the final report, Data Section can be added to give a summary about the dataset (e.g. the number of columns, the type or description of the columns and total observations) that is used for the analysis of this project.

I think you guys still need to add tests for the script files. it is one of the metrics standard.

In src folder, there are too many file in just the root, it would be better to seperate and arrange those file into more appropriate ones.

As what we have been taught in 531 (data visuaiazation), there could add a narrative so that it will be easy for others to follow along with what you have done here .I am sorry, this one is picky(I really can't find another one but it demands at least 5-point constructive feedback)

Again, great work! It's really hard for me to pick out other points that need to be improved and I got some great ideas for our project after reading yours! Thank you! Linhan

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi @lipcai,

Thank you for your review! To your comments

You can find the flowchart from /src/flow-chart
We go through the data in details under data cleaning section. To avoid redundant information, we did not describe our dataset with much details. You can find detailed information about our dataset from our EDA report.
Tests are added for scripts and you can now find it from /src/scripts
We have reorganized our /src folder
We made minor changes to our visualizations.

Thanks again for your comment. Please let us know if you have more questions or concerns.