Submission: 11: Predicting Burned Area of Forest Fires Using k-NN Regression

Submitting authors: @samzzzzzh @a-kong @Jaskaran1116

Repository: https://github.com/DSCI-310/DSCI-310-Group-11

Abstract/executive summary: A wildfire is an uncontrolled fire that starts in the wildland vegetation and spreads quickly through the landscape. A natural occurrence, such as a lightning strike, or a human-made spark can easily initiate a wildfire and wipe away millions of properties. However, the extent to which a wildfire spreads is frequently determined by weather conditions. Wind, heat, and a lack of rain may dry out trees, bushes, fallen leaves, and limbs, making them excellent fuel for a fire. In this project, we wish to predict the burned area of forests based on several environmental factors with a k-NN regression model. By establishing a transparent link between them, it is possible to identify potential risk factors and take appropriate safeguards to prevent the emergence of forest fires and the disasters they generate.

Editor: @ttimbers

Reviewer: @gzzen @alexkhadr @mcloses @snowwang99

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Reviewer: @gzzen

[x] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the [MDS code of conduct](https://ubc-mds.github.io/resources_pages/code_of_conduct/).

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an [OSI approved](https://opensource.org/licenses/alphabetical) software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

In short, good work! The entire project suits the framework as stated in the outline, and following the instructions in README page allows me to successfully reproduce the analysis. The report is well-structured and supported by sufficient empirical evidence and references.

I have couple of thoughts and suggestions while reading the report that might worth considering.

Although I’m able to view the html version of the final report within the notebook directory, duplicating a copy of the html file at the results could be more friendly for someone else to find what they need. How I achieved it was adding a line of cp [path_of_html] results/[report_name] under Makefile, I hope it could somewhat help.
In the report, Figure 1 with large correlation values are extracted as Figure 2. However, there might be lack of explanation of what to be “quite large”. Is that beyond a specific threshold (If so, what’s the value of threshold)? Or did some sort of hypothesis testing with a certain significant value is used? Some further clarification could make the report more trustworthy at this point.
Also in the correlation matrix in Figure 2, some of the plots might be excessive and unnecessary. If I understand the report correctly, only 5 pairs of variables have “strong correlation”, but Figure 2 has displayed 21 pairs. For example, ISI and RH with a correlation of -0.150, ISI and DC of 0.216 or even the density of variables could be redundant given the purpose of Figure 2. In this case, a correlation matrix may not be a best solution. Instead, I would recommend drawing a scatterplot for each of the 5 pairs of “strongly correlated” variables. And if possible, adding a regression line can make the plot fancier (but not necessary).

Attribution

This was derived from the [JOSE review checklist](https://openjournals.readthedocs.io/en/jose/review_checklist.html) and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 45min

Review Comments:

Great job! The repository was clear to explore, the instructions on the initial page easily allowed me to reproduce the analysis using 'Make' and to explore the code using the docker environment. Some personal changes I might suggest would be:

I usually like to have my .R scripts be callable from other code, so it would maybe help in the future if each script could have a main function that executes the actual function [ function main( ) calls function data_splitting( ) ], so it can be executed from the command line, but at the same time also be imported as a file into other code that might use that second new function [ another script being able to call data_splitting(url, out_dir) ].
In the report analysis I feel there was a sudden change from EDA to Model Evaluation. Reading through it you might find going from exploring the variables to finding the best k too confusing, as which and the why of the model is not introduced. You discuss it further in the Discussion section, but it might be worth mentioning in between as I feel it helps follow the analysis "narration" better.
Some of the references in the analysis report were hard to tie with the actual analysis. Maybe mentioning them during the analysis and where they were used might make it easier to relate them to the project when you read through them after finishing the report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour 30 mins

Review Comments:

When trying to reproduce the project, I ran into an issue that states "the input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'" when trying the run the docker run command. It ran when I added "winpty" in-front of the "docker run -it --rm -p 8888:8888 -v /$(pwd):/opt/notebooks a0kay/dsci-310-group-11 make -C /opt/notebooks" command but it would be useful to add a statement in the readme file that explains that the user should add the word winpty in-front of the command if they run into the same issue.
The test scripts do not have any documentation. All the scripts that contain functions do but I think adding documentation to the test scripts would help understand what the tests do and each test purpose.
The overall writing was very interesting and overall enjoyed looking at the analysis. I think the table in the methods and results section of your analysis was the only thing that was a little hard to read. The numbers and column names are very close to each other making it hard to read
Overall I thought the project was very well done. I only ran into the small mistake when reproducing it which was a very easy fix. I thought the research question was very interesting and there was a very good background provided on the research topic. Very well done on the project and great job by the team

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: snowwang99

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

Overall, this data analysis is really impressive to me. The README file is clear and the summary has concluded what you have done for the analysis and I like it very much. Here are several suggestions after running the file:

I tried to run the file by following the instructions in README through Docker but from my end, it seems that it doesn't work. Maybe you can have double-check of the instruction for the reviewers to run the file through the docker.
For the Dataset Information, I think you don't need to list all of the variables. Instead, you can point out which variables you think are useful and will be used in the analysis. Because when you list all the variables, it is not easy for the reader to read them all and remember them.
For the dataset folder, all the datasets can be viewed directly and easily. Each data file looks clean and has a clear name to distinguish others. The datasets are easy to be archived and accessible. It is a really good point for viewers to read when they have any questions about the analysis.
For the data analysis part, it is really nice to have a ggplot graph. But I suggest that the collaborators can indicate the reason why they are going to choose those variables from the ggplot graph first and then wrangle the data after. The viewers and readers can have a comprehensive understanding then.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310 / data-analysis-review-2021