Submission: Group 14: Predicting the renewable electricity output of different countries

Submitting authors: Caden Chan, Neha Menon, Peter Chen & Tak Sripratak

Repository: https://github.com/DSCI-310-2024/DSCI310-Group14/tree/v3.0.0

Abstract/executive summary:

As a complex issue, climate change doesn't have a singular cause, though the impacts of burning fossil fuels is a large source of greenhouse gases, and has caused detrimental effects. Our analysis here attempts to explore if a subset of renewable energy related World Development Indicators along with a simple linear regression model can be used to predict renewable electricity outputs of countries throughout the world. Our analysis created a model with an Root Mean Squared Error (RMSE) score of 23.74. Our model was able to predict most cases accurately though there are some predictions with low accuracy, not close to the actual values. Our model did predict some countries to have a negative renewable electricity output which demonstrates the need for a more complex analysis to be conducted, using advanced machine learning methods. By creating an advanced machine learning model, the capabilities of countries to produce more renewable electricity based on their other World Development Indicators can be calculated and used to influence country specific and global goals and targets.

Editor: @ttimbers

Reviewer: Hanyu Dai, Sana Shams, Daniel Lima, Stephanie Ta

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer: Stephanie-Ta

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ ] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality (review in progress)

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Overall, I believe your project is well done! I'm just nit-picking with my criticisms since there aren't any apparent major problems!

At first, it is unclear why there is an src folder in the scripts folder. Perhaps could have 1 src folder in the root with an analysis-scripts and a functions folder to add clarity to project organization.
Please include a list of dependencies in the README.md, including Docker Desktop!
At first, I was unable to run your analysis when I ran the command $ docker-compose run --rm final-analysis-env make clean. It turns out that I forgot to start up Docker Desktop before running the command, which is why I ran into the issue! You may want to include that as a step in the Usage section of your README.md.
It is quite hard to read the axis labels of Figure 1. It would be nice if the image was bigger or if the graphs used a bigger font!
I like how you guys have documentation for each test in test_cleaning_data.py, test_eda.py, test_impute_split.py, and test_linear_regression.py. It would be really nice to do the same for each test in test_datareading.py!
I also like your NumPy style docstrings for create_scatter_plots() and impute_split(). They provide valuable information about those functions. It may be extra helpful if some examples were added in the docstrings!
The functions clean_data(), reading_data(), split_xy_columns(), and plot_rmse() would benefit from NumPy style docstrings too!
I noticed that there is some commented-out code in scripts/eda.py and scripts/readingdata.py. If it's not needed, you could remove it to improve the 'cleanliness' of the code!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Daniel Lima

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

First of all, good job on your analysis!

Some things I would pay attention to.

Prune some of the unnecessary files such as your environment files, Comments within certain code portions or comments within your dockerfile.
Create a directory for your test data as it seems cluttered with everything under the test directory.
Change some of your file names to represent more accurately what they represent. (Ex. WDICSV) Or fix them to correctly follow conventional file names (Ex. functionread.py -> functionRead.py, readingdata.py -> readingData.py etc..)
As the previous reviewer mentioned, Place the SRC folder in your root directory, not under your scripts.
This is minor but make sure your CONTRIBUTING.MD is much more detailed, take the template contributing file as an example, it is much more vast and lists specific contact methods to address issues.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Sana Shams

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Great work on your analysis, especially the clarity of explaining all components + motivation of your analysis in the final report! Here are some issues/areas of improvement I was able to identify, and my suggestions for each:

In the PDF rendering of your report, Figure 1 is actually cut off horizontally from the fourth column. Perhaps when you are generating the visualizatin, manually set the number of columns to be 2 or 3 so that the visualization fits the 8.5x11inch standard page size in the PDF rendering. Figure 1 is also missing a title, while the axis labels are descriptive, a title such as “Renewable Electricity Output vs Explanatory Variables of Interest” would be a good idea.
The text in Table 1 does not render correctly in the PDF rendering, and is cut off horizontally like Figure 1. Numbers and letters overlap, making it difficult to read out what is being said. Also, Table 1 takes up 2.5 pages, which I believe is not intended when comparing to the HTML rendering, which has a horizontal scroll feature unlike the PDF. Since the analysis choses to take 2015 as the most recent year (mentioned in Methods and Results, Step 3), it might be a good idea to take out the years occurring after that in the table to help it fit in the PDF rendering.
You can cut down on your repetition in the Dockerfile, what you have right now works perfectly fine but doing the following might help with redundancy and neatness! I've included an example of a side-by-side comparison so you can see the difference.
- Instead of running RUN conda install [insert package] for each package, you can write:
  - RUN conda install -- yes \ [insert package] \ [insert package] \ [insert package]
- Instead of running RUN apt-get update [insert package] several times, you can write:
  - RUN apt-get update && apt-get install -y [insert package] \ [insert package] \ [insert package]

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024