Submission: Group 15: Wildfire Predictor

Submitting authors: ahul Brar, Fiona Chang, Lillian Milroy, Darwin Zhang

Repository: https://github.com/DSCI-310-2024/dsci310-group-wildfire-predictor/releases/tag/Milestone-3

Abstract/executive summary:

In this analysis, we train a linear regression model capable of predicting wildfire intensity, which is measured by the geographic area affected by fires. The trained model performed well when making predictions on unseen data, exhibiting an RMSE of 72.954 and a R-squared score of 0.948.

We used data about Australian wildfires collected using thermal imaging technology and processed by IBM (Hamann and Schmude, 2021). The data was sourced from GitHub, and the specific csv we used can be accessed here (Krook 2021). Each row in the dataset represents a day's worth of information about the number, spread, and intensity of fires within one of seven regions in Australia, dating back to 2005.

Editor: @ttimbers

Reviewer: Amar Gill, Riddhi Battu, Lucas Liu, Sid Ahuja

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist (WIP)

Reviewer: agill59

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Environment file contains python.app which was not available for me on the channels I had access to, had to comment it out to create environment.

Data folder can be cleaned up (Has empty dirs).

Test suite for test_relevant_features.py could be more robust.

Docker commands given in readme do not work.

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: sidahuja1

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

While the documentation regarding the environment and docker set up was clear, I wasn't able to use make all or make clean when using docker-compose.

Test folder documentation can also be added to root folder README file

test_relevant_features.py can be more comprehensive

Some functions in src don't have documentation describing what they do

I tried rendering the qmd notebook via RStudio and could not see the images. It might be a good idea to include rendered html or pdf in the reports folder

Table in report requires a subtitle

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist(wildfire-predictor)

Reviewer: SugarLucas

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Should include reference in the report.qmd file although it was provide in the ipynb file under the src folder.
Pytest was not included in the environment.yml file as well as the dockerfile, which causes troubles when running the tests.
Some functions inside the src folder does not have proper documentation about parameter, return and examples
It would be nice to add the usage of a function in the src directory in the prepocessing.py and download_data.py script.
In the readme section, the docker compose run make all command from my local end shows error though the make clean command is able to run

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: riddhibattu

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well-organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

General Observations
Your project on wildfire prediction presents a significant and timely analysis. The choice of dataset and the methodology applied demonstrate a thoughtful approach to an important environmental issue. Below, I offer constructive feedback aimed at enhancing the clarity, reproducibility, and overall impact of your work.

Technical and Documentation Improvements

Jupyter Notebook Execution: The README.md file links to the wildfire-prediction.ipynb notebook, which displays an error related to the 'os' module and exhibits non-sequential execution (e.g., jumping from In [1] to In [28]). It is critical to restart and run all cells sequentially to ensure reproducibility and coherence for readers. Despite this, the analysis proceeds as expected when run manually via Jupyter Lab.
Reference Documentation: I noticed several references (6 in total) lacking DOIs. Where DOIs are unavailable, including direct links to the references could enhance the report's credibility and utility.
Report Accessibility: Providing the final report in PDF or HTML format, in addition to the Jupyter notebook, would greatly improve accessibility and readability.
Build Commands: The make clean and make all commands did not execute successfully as per the instructions in the README.md. This issue might hinder the reproducibility of the analysis environment.
Code Optimization: There are instances of unused package imports within the code. Streamlining these imports to include only necessary packages would enhance the code's efficiency and readability.
Data Presentation: For the correlation matrix, consider using more descriptive names rather than abbreviations with underscores to improve readability and interpretation.
Visualization Clarity: The correlation_matrix.png is partially cut off. Adjusting the image's dimensions could ensure the entire matrix is visible and interpretable.
Quarto Document Rendering: Manual rendering of the QMD to PDF revealed issues with image display. Ensuring images render correctly in all document formats would greatly enhance the presentation quality.
Navigability: Adding hyperlinks to tables within the Quarto document would improve navigability and reader understanding, especially when referencing specific data.

Errors and Solutions

Encountered errors related to file not found (404: Not Found) for several figures and the report PDF. These errors suggest issues with file paths or rendering processes. Ensuring accurate path references and successful rendering in both HTML and PDF formats would resolve these visibility issues.

Test Data Clarification: An ambiguously named empty.zip was found within the tests directory. Renaming this to something more descriptive, such as test_data.zip, would clarify its purpose.
Testing Documentation: Including specific instructions on how to run the tests would aid in validating the project's reliability and functionality.

Technical Specifications

Dockerfile Versioning: The Dockerfile lacks specific versioning for 'make'. Specifying version numbers could prevent compatibility issues and ensure consistent environment replication.
Environment Management: Similar to the Dockerfile, the environment.yml file would benefit from including specific package versions to ensure consistent, reproducible analysis environments across different setups.

Closing Thoughts

Overall, your project demonstrates a commendable effort in addressing a critical environmental concern. The analysis is well-conceived, and with the suggested improvements, its impact and accessibility could be significantly enhanced. I look forward to seeing the continued development of this important work.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024