Submission: Group_09: RaincouverPrediction

Submitting authors: @wqxxzd @dorisyycai @yhan178 @sivakornchong

Repository: https://github.com/UBC-MDS/RaincouverPrediction Report link: https://ubc-mds.github.io/RaincouverPrediction/raincouver_prediction_report3.html Abstract/executive summary: Our project investigates the prediction of daily precipitation in Vancouver using machine learning methods. Using a dataset spanning from 1990 to 2023, we explored the predictive power of some key environmental and cliamte features such as temperature, wind speed, and evapotranspiration. Our results suggest the best classification model is Support Vector Machine with Radial Basis Function (SVM RBF) model with the hyperparameter C=10.0. The model achieved a notable F1 score of 0.87 on the positive class (precipitation is present) when generalized to the unseen data, suggesting a high accuracy in precipitation prediction. We also explored feature importance, showing ET₀ reference evapotranspiration and the cosine transformation of months as robust predictors. Hyperparameter optimization did not make improvement to our curren model, indicating the potential need for feature engineering or incoportating more features. Our preject presents a reliable model for predicting precipitation with potential practical applications in various fields.

Editor: @ttimbers Reviewer: Sharon Voon, Anu Banga, Jenny Lee, Alysen Townsley

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: s-voon

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

1) Looking very good overall. Just a few typos in the report which can be avoided such as (cliamte (Summary), curren (Summary), cylical (Section 2.3)) 2) Maybe include why you chose f1 score as the matrix or what does f1 score represents? In Fig 5, you included the values for recall, precision, and support but did not mention any of them in your report except for f1-score which makes the respective columns redundant in my opinion. 3) You included an environment file in your root repo but there is no instruction for how to use it, so I would suggest you to move it to maybe archive folder (?) as the dependencies included in the environment file seems to exclude click, jupyter-book and so on. Well done~~

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @AlysenTownsley

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

Comment are general comments; FIX means that this is an issue which should (if you are so inclined) fix in the Milestone 4 submission.

General checks

[x] Repository: Is the source code for this data analysis available?
Comment: The source code is available – scripts are in scripts and ipynb file is in the reports directory.
FIX: Is the notebooks directory required? The report will be rendered by the jupyter book in reports, so perhaps this folder can be removed?

Is the repository well organized and easy to navigate?

Comment: The repo is well organized in a logical way I.e. data, docs, notebooks, reports, results, src, test. The standalone files are correct I.e. .gitignore, contributing, Dockerfile, license, readme.md, code of conduct, docker-compose, and the environment.yaml file.
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

FIX: This project uses the MIT license which protects code only. A Creative Commons license should be used as well to cover the report / writing section of the repo.

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
Comment: Looks good! All dependencies are listed in the Dockerfile.
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Comment: Yes, the authors list how to use the software in the Usage section. The set-up and analysis instructions are clear.
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Comment: Yes, the functionality is documented and clear.
[x] Community guidelines: Are there clear guidelines for third parties wishing to:

1) Contribute to the software

Comment: The contributing.md file is well written and easy to follow.

2) Report issues or problems with the software

Comment: It is clear that an issue should be created to discuss the problem prior to creating a pull request to fix changes.

3) Seek support

FIX: It is not clear who to reach out to in the case that users have questions or need help. It doesn’t seem like Tiffany has this listed in her example repo either, but perhaps you could list a contact within the Readme of contributing file.
FIX: Your code_of_conduct.md references Tiffany’s email for issues. This should be changed to one of the project team members.

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Comment: Code is well documented with comments.
FIX: A full docstring (examples, input data types, etc.) might help users to understand how to use the scripts more clearly. This is relevant for the classification.py, drop_split_preprocess.py, and eda.py scripts.
[x] Style guidelines: Does the code adhere to well known language style guides?
Comment: As mentioned above, recommend to add full docstrings.
[x] Modularity: Is the code suitably abstracted into scripts and functions?
Comment: Yes. No code left in the report.
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Comment: Tests can be run from the terminal and are of excellent quality! Great work.

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
Comment: Yes, in the data folder.
[x] Computational methods: Is all the source code required for the data analysis available?
Comment: Yes, all code is available and reproducible.
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Comment: Required software is listed in the Dockerfile which is referenced in the readme. All required software can be loaded with Docker compose up.
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Comment: I was able to enter the docker container and reproduce the results from the JL terminal. Great work! Instructions are easy to follow. Tests, scripts and jupyter book are working. Do we also need to have instructions to reproduce the analysis in the virtual environment? Perhaps something to consider adding.

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
Comment: Yes authors are listed. Affiliations are not listed (UBC).
[x] What is the question: Do the authors clearly state the research question being asked?
Comment: Yes: (Our project investigates the prediction of daily precipitation in Vancouver using machine learning methods. Using a dataset spanning from 1990 to 2023, we explored the predictive power of some key environmental and climate features such as temperature, wind speed, and evapotranspiration.)
[x] Importance: Do the authors clearly state the importance for this research question?
Comment: Yes, agriculture, water management, etc. Great topic!
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
Comment: Yes. This is a topic relatable to everyday people and the report is understandable.
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Comment: Yes, ex. Transforming the cyclical weather data into a sine / cosine relationship.
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
Comment: Yes, in tabular form.
[x] Conclusions: Are the conclusions presented by the authors correct?
Comment: Yes, the conclusions seem to be correct.
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
FIX: All references should have a DOI:
[NTHJ01] https://doi.org/10.1002/joc.680.
[OGarciaSSCM14] 10.1016/j.atmosres.2014.01.012.
[PSE00] https://doi.org/10.1029/2000JD900415
[x] Writing quality: Is the writing of good quality, concise, engaging?
FIX: There are a few grammatical errors in the report which could be fixed, ex/ “Hyperparameter optimization did not make improvement to our curren model, indicating the potential need for feature engineering or incoportating more features.” “The performace of each model is plotted below”.
FIX: Your fig.3 has the x-axis labels a bit cut off. Suggest to re-size the image slightly. Your fig. 5 has quite a lot of white space padding. Suggest to resize the image.

Estimated hours spent reviewing:

1.5

Review Comments:

Comment: The report is very well done and detailed. All of your scripts and outputs work how they should and it was quick and easy for me to follow / reproduce. Aside from a few suggestions listed above (see FIX), the report seems to meet all of the requirements from the prior milestones. If you have any questions please feel free to reach out on Slack. :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @AnuBanga

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Attribution

Hi team, your analysis was engaging and a pleasure to review. The report is comprehensive and clear, offering sufficient information to grasp the analysis, supported by clear justifications for your methodologies.

Here are few suggestions you might want to consider:

There are similar files for different milestones in notebooks and reports folder. I'm uncertain which one to select for review. Would it be possible to tidy up the structure by removing any unnecessary or redundant files?
There are Spelling mistakes, e.g, cliamte , curren and incoportating, preject in project overview section.
Resize Images in figure 3 and 5. x-axis in figure 3 is truncated and figure 5 need to resized.
Consider adjusting the color scheme in the heatmap to enhance readability.
After hyperparameter optimization, precision, recall, f1 score, and support were mentioned, they weren't displayed while selecting the model based on the f1 score. Consider including a matrix or list of these scoring metrics before finalizing the choice based on the f1 score.
README.md - Add more references in README.md

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @jlee2843 Jenny Lee

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 Hour

Review Comments:

Potential improvements for Consideration:

Reproducibility: Was unable to open Jupyter book locally following the instructions on README.md.
Modelling: Are we considering each observation from one day to be independent to other days? As you are using a time series data, did you make any assumptions on why you were able to conclude that the observations are independent of another? Brief explanations on your approach to the models might be helpful since weather forecasting is usually dependent on past values. Otherwise, you can't predict whether it will be raining on a particular day or not since your evaporation data is already dependent on the rain that happened on the day.
Accuracy Metrics: Why was f1 score used as your accuracy metric? Have you considered using any other scores?
Write-Ups:
- In the discussion section you mentioned that feature engineering can be helpful in the future. What kind of feature engineering would you suggest from your data?
Figure & Table Display: Your table can be displayed as a table rather than as an image, perhaps by using the glue package.

Feedback for Appreciation

Was able to run all scripts following the instructions and could open the Jupyter notebook using docker compose up.
Overall, the write-ups are carefully written and easy to follow. Trying out multiple models is also very helpful. Figures and tables are adequately placed.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023