Submission: Group 1: Predicting Fatalities from Tornado Data

Submitting authors: Erika Delorme, Marcela Flaherty, Riddha Tuladhar, Edwin Yeung

Repository: https://github.com/DSCI-310-2024/DSCI-310-Group-1-Predict-Fatalities-From-Tornado-Data/tree/0.0.3

Abstract/executive summary:

In our project, we attempt to build a multiple linear regression model that will predict the number of fatalities from tornadoes using the features width (yards) and length (miles) of the tornado. We tested our multilinear regression model with and without outliers and compared differences in coefficients and RMSPE scores. Both models had low positive coefficients, suggesting a minimal yet positive impact on the prediction of tornado fatalities, and both had low RMSPE scores, suggesting a low amount of error in its predictions. The model without outliers had a lower RMSPE score, which is partly explained by the lack of outliers and thus making predictions on a smaller range, which reduces the error. Despite the limitations of our model, we believe that it can still have some utility in predicting tornado fatalities with little error. However, the model should be improved in the future before being deployed to improve the size of the coefficients and its predictive power. In the future, we may consider exploring other features in predicting fatalities, predicting the number of injuries from the same features, or even predicting the number of casualties (injuries and fatalities) from the same and additional features.

The data set that was used in this project is from the US NOAA's National Weather Service Storm Prediction Center Severe Weather Maps, Graphics, and Data Page. It was tidied and sourced from TidyTuesday and can be found here. Each row represents a tornado, along with various features, including width, length, date, time, state in the US, magnitude, financial losses, number of fatalities, number of injuries, etc.

Editor: @ttimbers

Reviewer: Andrea Jackman James He Neha Menon

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ ] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Good job on this repository; everyone contributed and engaged in committing and pushing to the repository. Applause to everyone’s effort. The README.md is original and easy to read, but there is one issue: the missing link to the final report. I think this is due to the fact that the analysis.ipynb is deleted from the repository.
- Since the analysis.ipynb file is missing, it created some trouble for me when reading through the group’s research topic and question. I was only able to find them after cloning the repository and opening up the tornado_fatalities_predictory.html to access the report that way.
The usage instruction on how to run with docker-compose.yml is very detailed, but no instructions for using the environment.yml. However, the environment.yml file still exists in the root of the repository. If someone wants to create an environment from this project, they would run into trouble if they don’t know how to create an environment with renv.
I noticed that 01_download_data.r, 02_clean_preprocess_data.R, 03_eda.R. While 01_download_data.r ends with lowercase rather than uppercase like the other scripts.
Inside the docs folder, there is only a tornado_fatalities_predictor.html but no tornado_fatalities_predictor.pdf.

Just my curiosity: I wonder if SLR is the most suitable method for predicting tornado fatalities compared to other prediction models e.g. decision trees, e.g. SVM RBF. Factors such as population density and infrastructure across states can also influence the number of fatalities, which could potentially explain the outliers when predicting with length and width.

This is a great repository, and I thoroughly enjoyed reading your analysis. It's evident that a lot of effort and time has been put into structuring the repository and setting up instructions to make it easier for users to follow. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: ajackman2

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[ ] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Overall, the report is well laid out and explained. The repository is very well-organized and labeled, good job! You obviously worked very hard on the report and made a great final product. Below I have made some comments about things I think you can improve.

In the abstract I think it would be beneficial to explain what RMSPE is in a little more depth so that anyone unfamiliar with it can understand what your RMSPE values mean. This will help people who read the report understand what your models are doing better.

It would be nice to link to the license in the README.md file, so users can easily navigate to the license if they wish to view it.

I noticed that in the 'tests' folder there is a file 'vdiffr.Rout.fail' I'm not sure what this file is and what it's purpose is.

I don't see a pdf version of the final report rendered in the docs folder. The steps for creating a pdf version seem to be missing from your MakeFile and qmd.

I also tried to follow the link in your report to the 'tornado_fatalities_predictory.ipynb' and it is no longer a valid link as you have split the code into multiple scripts. Consider changing to this link the 'src' folder.

When I run your tests, one of them fails "Failure (test-accuracy_plot.R:6:3): refactoring our code should not change our plot Snapshot of testcase to 'accuracy_plot/accuracy-plot.svg' has changed"

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: nehamenon704

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[ ] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Overall, I found the topic very interesting and unique, and the analysis was conducted very well! I was able to follow along with all steps and navigate through the repository easy. In addition, putting author and date information on function files was very helpful, and provides an accessible record of when each file was written, arguably easier than parsing through the commit history.

Here are my items of feedback on parts that could be improved:

Location of final report: o After running the analysis, I expected the report to be in the results directory, but found it in the docs directory. I would recommend either merging these two directories or renaming the docs directory to something like reports. This would help to enhance repository organization and avoid confusion with anyone running your analysis.
Tag the version of quarto: o While going through the repository and the Dockerfile, I noticed that the version of quarto was not pinned. To ensure that the container can be built/run on other people’s systems, I would recommend tagging this version.
Usage instructions: o These instructions are very well-written, and I like the level of detail. However, to ensure that all the information is provided, I would recommend providing the code needed for someone to clone the repository (git clone …). All the other instructions have the required code below, and to help with consistency, adding it for this one would help.

Again, a really good effort and coherent analysis, good job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024