Submission: Group 2 Forest Fire Prediction

Submitting Authors: @voremargot , Hatef Rahmani, @gauthampughaz @Anahita97

Repo Link: https://github.com/UBC-MDS/forest-fire-area-prediction-group-2 Report Link: https://github.com/UBC-MDS/forest-fire-area-prediction-group-2/blob/dev/reports/Final_report.md

Summary: We have created a simple prediction model to predict the size of forest fires using weather and soil moisture properties. We explore a data set from northeastern Portugal that contains spatial features, temporal features, soil moisture indices, and weather features to predict the size of wildfires within the Montesinho natural park. We create a Support Vector Regression (SVR) model using the soil moisture variables, temperature, relative humidity, wind, spatial coordinates, and season. After removing outliers using Cook’s Distance method, we optimize our model using mean absolute area (MAE) and root mean square error (RMSE). Our optimized model, with C = 1.88 and γ = 0.48, produces a MAE of 8.686 and an RMSE of 28.46 on the unseen test data set, which is good for our area burned values which range from 0 to 1,090 ha.

Editor: @flor14 Reviewer: @Luming-ubc @mahsasarafrazi Aldo de Almeida Saltao Barros Daniel King

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @Luming-ubc

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Points being done well:

The model could be very useful for predicting the size of the forest fires and could potentially guide the rescue forces in the realistic context as needed.
The data visualizations from the EDA part is formatted beautifully.
The source code was well separated into meaningful functions / modules.

Points could be improved:

Below the title of the final report, I noticed the list of authors. However, I think it is worthing adding the affiliations as well such as adding "University of British Columbia".
In the final report, although the authors described each figures by referring figure number, the captions for figures are missing. This might be a rendering issue. The caption could include the Figure number and title.
There is minor rendering issue in the final report. For example, in the sentence: "Furthermore, the MAE score is less than the RMSE score which is sensible as we should normally have MAE ≤ RMS**E". This could be improved by using Latex expressions.
There is another rendering issue in the final report in the latex expression: "Cook’s Distance method with a threshold of $\frac{4}{n}$ "
The report mentioned using Cook's distance method to identify the outliers. However, it does not briefly explain what is Cook's distance. I think it would be helpful to include a link or reference to explain this concept.
The report did explain how they select or drop some of the features. However, the feature names and meanings are not explicitly explained. For example, In this sentence, "The variables FFMC, DMC , DC, ISI, temp, RH, wind, X, Y and season were used to fit the model." It would be hard to interpret what is FFMC, DMC, DS.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @mahsasarafrazi

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelines: Does the code adhere to well-known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hour

Review Comments:

nice and tidy repo, all the files are accessible and each part is classified perfectly.
The scripts, EDA, and coding are well designed and fully described
in the README part, instead of having one part as "Forest Fire Area Prediction" you can split it into two different par and shorter paragraphs such as: "About the project", "Background of forest fire", "Predictive question and sub-questions".
As you have environment.YAML file in the "Dependencies" part, it is better not to list all the packages in dependencies, and just tell how to set up the environment, the rest is addressed in the environment file, it can make the summary of your report shorter.
In your analysis, it is better to have a very brief section regarding "How to improve the results" and give your suggestions and findings of the improvement of results, since there might be other models that work better and because of constraints that you had, you could not use them. so if anybody wants to reproduce your result, they can use another model.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @aldojasb

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 and half hour

Review Comments:

Congrats to the team, the general project is very well tidy and easy to understand. I have just a couple of minor suggestions in this version:

You can use a more powerful introduction to engage the audience about the importance of your work. Some like, "this project has the potential to protect XXX lives once we got a good predictor."
I would use fewer charts in figure 3. It's a bit confusing and you can just select some features to create this chart.
The read file could be a little bit more succinct. But It's just a personal taste I have, I prefer very small and directly to the point README files.
If possible, try to explain the features that you are using in a very easy (and not so academic) way. I think that will be interesting for people with don't have an earth science background.
Have you thought in use other models to compare with SVR? That would be something interesting to make your analysis more reliable and show that your model is really powerful.

Data analysis review checklist

Reviewer: @danfke

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall, the project is very interesting and the analysis clearly had a lot of thought put into it and was well executed!

Particularly well:

-The README and final report are very well written. There is a clear and admirable motive towards answering your research question, explanation of the steps that were undertaken in the project and the results obtained. -The use of statistical techniques to remove outliers is impressive. -The charts are beautiful. I just have a slight concern with figure 2, raised below. -The final paragraph of the final report shows well thought-out reflections on potential improvements and current shortcomings.

Could be improved:

-Very minor detail: The last sentence of the beginning section of the README says “has almost zero correlation” when it should be “have almost zero correlation” -The link to the final report in the README does not work. -Latex doesn’t render properly in the final report’s analysis. -The figure captions for all of the figures don’t appear in the Results and Discussion section of the final report. I have the same problem in my report, Tiffany recommends either using pdf or html and creating a GitHub pages. -It is hard to understand what is going on in Figure 2, specifically the fact that there is overlapping box plots in the bottom chart. What do the colors represent? Would this be better as a stacked bar chart? Are the color separations necessary or can each season just be a single color? -It might be too strong of a statement to say that hyperparameter tuning improved the models without mentioning the fact that the training and validation errors all within standard deviation before and after tuning. If the standard deviations are ignored, it appears that the training-validation gap for the MAE optimized model actually increases slightly after tuning.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks everyone for your suggestions, we have tried our best to incorporate your feedback into our project.

The report mentioned using Cook's distance method to identify the outliers. However, it does not briefly explain what is Cook's distance. I think it would be helpful to include a link or reference to explain this concept. We have added a link explaining Cook's distance. This has been addressed here.
Latex doesn’t render properly in the final report’s analysis. This is due to rendering the .Rmd to a .md file. We have addressed this by rendering our final report to .HTML file. This has been addressed here.
The figure captions for all of the figures don’t appear in the Results and Discussion section of the final report. I have the same problem in my report, Tiffany recommends either using pdf or HTML and creating a GitHub pages. In an .md file, figure captions only show when hovering over a plot. This has been solved by rendering to an HTML file. This has been addressed here.
It might be too strong of a statement to say that hyperparameter tuning improved the models without mentioning the fact that the training and validation errors all within standard deviation before and after tuning. If the standard deviations are ignored, it appears that the training-validation gap for the MAE optimized model actually increases slightly after tuning. As mentioned after considering the standard deviations, we noticed that the model does not improve vastly after hyperparameter tuning. Therefore, we made sure to state that the model does not improve much hyperparameter tuning. This has been addressed here.
One of our reviewers has not given us the checkmark for: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness? We have written tests so that the function of the software can be verified. This has been addressed here.

UBC-MDS / data-analysis-review-2021