DSCI-310 / data-analysis-review-2021

0 stars 1 forks source link

Submission: 11: Predicting Burned Area of Forest Fires Using k-NN Regression #11

Open ttimbers opened 2 years ago

ttimbers commented 2 years ago

Submitting authors: @samzzzzzh @a-kong @Jaskaran1116

Repository: https://github.com/DSCI-310/DSCI-310-Group-11

Abstract/executive summary: A wildfire is an uncontrolled fire that starts in the wildland vegetation and spreads quickly through the landscape. A natural occurrence, such as a lightning strike, or a human-made spark can easily initiate a wildfire and wipe away millions of properties. However, the extent to which a wildfire spreads is frequently determined by weather conditions. Wind, heat, and a lack of rain may dry out trees, bushes, fallen leaves, and limbs, making them excellent fuel for a fire. In this project, we wish to predict the burned area of forests based on several environmental factors with a k-NN regression model. By establishing a transparent link between them, it is possible to identify potential risk factors and take appropriate safeguards to prevent the emergence of forest fires and the disasters they generate.

Editor: @ttimbers

Reviewer: @gzzen @alexkhadr @mcloses @snowwang99

gzzen commented 2 years ago

Reviewer: @gzzen

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

In short, good work! The entire project suits the framework as stated in the outline, and following the instructions in README page allows me to successfully reproduce the analysis. The report is well-structured and supported by sufficient empirical evidence and references.

I have couple of thoughts and suggestions while reading the report that might worth considering.

  1. Although I’m able to view the html version of the final report within the notebook directory, duplicating a copy of the html file at the results could be more friendly for someone else to find what they need. How I achieved it was adding a line of cp [path_of_html] results/[report_name] under Makefile, I hope it could somewhat help.
  2. In the report, Figure 1 with large correlation values are extracted as Figure 2. However, there might be lack of explanation of what to be “quite large”. Is that beyond a specific threshold (If so, what’s the value of threshold)? Or did some sort of hypothesis testing with a certain significant value is used? Some further clarification could make the report more trustworthy at this point.
  3. Also in the correlation matrix in Figure 2, some of the plots might be excessive and unnecessary. If I understand the report correctly, only 5 pairs of variables have “strong correlation”, but Figure 2 has displayed 21 pairs. For example, ISI and RH with a correlation of -0.150, ISI and DC of 0.216 or even the density of variables could be redundant given the purpose of Figure 2. In this case, a correlation matrix may not be a best solution. Instead, I would recommend drawing a scatterplot for each of the 5 pairs of “strongly correlated” variables. And if possible, adding a regression line can make the plot fancier (but not necessary).

Attribution

This was derived from the [JOSE review checklist](https://openjournals.readthedocs.io/en/jose/review_checklist.html) and the ROpenSci review checklist.

mcloses commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 45min

Review Comments:

Great job! The repository was clear to explore, the instructions on the initial page easily allowed me to reproduce the analysis using 'Make' and to explore the code using the docker environment. Some personal changes I might suggest would be:

  1. I usually like to have my .R scripts be callable from other code, so it would maybe help in the future if each script could have a main function that executes the actual function [ function main( ) calls function data_splitting( ) ], so it can be executed from the command line, but at the same time also be imported as a file into other code that might use that second new function [ another script being able to call data_splitting(url, out_dir) ].

  2. In the report analysis I feel there was a sudden change from EDA to Model Evaluation. Reading through it you might find going from exploring the variables to finding the best k too confusing, as which and the why of the model is not introduced. You discuss it further in the Discussion section, but it might be worth mentioning in between as I feel it helps follow the analysis "narration" better.

  3. Some of the references in the analysis report were hard to tie with the actual analysis. Maybe mentioning them during the analysis and where they were used might make it easier to relate them to the project when you read through them after finishing the report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

alexkhadr commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour 30 mins

Review Comments:

  1. When trying to reproduce the project, I ran into an issue that states "the input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'" when trying the run the docker run command. It ran when I added "winpty" in-front of the "docker run -it --rm -p 8888:8888 -v /$(pwd):/opt/notebooks a0kay/dsci-310-group-11 make -C /opt/notebooks" command but it would be useful to add a statement in the readme file that explains that the user should add the word winpty in-front of the command if they run into the same issue.

  2. The test scripts do not have any documentation. All the scripts that contain functions do but I think adding documentation to the test scripts would help understand what the tests do and each test purpose.

  3. The overall writing was very interesting and overall enjoyed looking at the analysis. I think the table in the methods and results section of your analysis was the only thing that was a little hard to read. The numbers and column names are very close to each other making it hard to read

  4. Overall I thought the project was very well done. I only ran into the small mistake when reproducing it which was a very easy fix. I thought the research question was very interesting and there was a very good background provided on the research topic. Very well done on the project and great job by the team

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

snowwang99 commented 2 years ago

Data analysis review checklist

Reviewer: snowwang99

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Overall, this data analysis is really impressive to me. The README file is clear and the summary has concluded what you have done for the analysis and I like it very much. Here are several suggestions after running the file:

  1. I tried to run the file by following the instructions in README through Docker but from my end, it seems that it doesn't work. Maybe you can have double-check of the instruction for the reviewers to run the file through the docker.

  2. For the Dataset Information, I think you don't need to list all of the variables. Instead, you can point out which variables you think are useful and will be used in the analysis. Because when you list all the variables, it is not easy for the reader to read them all and remember them.

  3. For the dataset folder, all the datasets can be viewed directly and easily. Each data file looks clean and has a clear name to distinguish others. The datasets are easy to be archived and accessible. It is a really good point for viewers to read when they have any questions about the analysis.

  4. For the data analysis part, it is really nice to have a ggplot graph. But I suggest that the collaborators can indicate the reason why they are going to choose those variables from the ggplot graph first and then wrangle the data after. The viewers and readers can have a comprehensive understanding then.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.