Submission: GROUP 4: energy_efficiency_analysis

Submitting authors: @Suraporn @YHuUBC @MNBhat

Repository: https://github.com/UBC-MDS/energy_efficiency_analysis Report link:https://github.com/UBC-MDS/energy_efficiency_analysis/blob/main/doc/energy_report_rmd.Rmd Abstract/executive summary: Building towers or any building structure nowadays is not difficult if you can afford it, but building it to be the most memorable and efficient is another story. When considering building new towers or skyscraper buildings, it will be great if we know exactly what building parameters relate to their energy efficiency. As a result, we would be able to design not only a magnificent building to remember but also a renowned energy-efficient building.

In this project, we aim to answer questions as,

given building-related features such as Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height', 'Orientation', 'Glazing Area', and 'Glazing Area Distribution', how accurately can we predict the 'Heating Load' of the building?
What is the contribution level of each feature associated to the 'Heating Load' of the building.

Editor: @flor14 Reviewer: Ziyi Chen Caroline Tang Shirley Zhang

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Caroline Tang (@carolinetang77)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Trying to design energy-efficient buildings is an interesting topic! It's interesting to think about how these different factors might be affecting the energy usage of a building. However, I don't have a background in engineering/physics, so it would have been helpful to explain what the different features (e.g. 'Glazing Area' and 'Glazing Area Distribution') and targets mean (e.g. What is 'heating load'? Is high heating load good or bad?)
The directory organization looks great! I especially like the models folder with the subfolders for each model used.
The overall analysis looks good, but unfortunately I wasn't able to recreate the analysis in an automated way. Specifically, I wasn't able to replicate the environment using the yaml file, likely due to differences in operating systems. When exporting your conda environment yaml file, try using conda env export --from-history to avoid this issue. The other lines of code in the usage section seem to work fine though.
The scripts are very well documented and easy to read/follow! Great work on that!
There are some grammatical errors in the final report, particularly in the steps of the EDA.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Lennon Au-Yeung @lennonay

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

For the Download data section in the report markdown, perhaps it would be better to add backticks between the command so it would be easier for users to copy and paste the line.
For the EDA sections, perhaps the author can add some comments on the data distribution of different features so that readers will be able to follow along the rationale behind performing data transformation such as scaling.
It was nice that the all the used models are saved in different folders and the file organization overall was very clear.
It took me some time to understand figure 5, perhaps it would be nicer to sort the data by their heating load value or it might be better to show the absolute error for the observations.
For the markdown file of the report, it would be better to make sure that all bullet points start with a capital letter so that they are standardized.

Overall, I congratulate the authors on successfully building a model from start to finish. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Ziyi Chen (zchen156)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.
Code of Conduct
[x] I confirm that I read and will adhere to the MDS code of conduct.
General checks
[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
Code quality
[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Reproducibility
[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Analysis report
[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?
Estimated hours spent reviewing:

2 hrs

Review Comments:

The proposal section is nicely written, and the question you wish to answer through your analysis is clearly stated in a separate session, laying a solid basis for the readers to build on as they progress. The usage section is fairly simple to understand, and all of the codings work. Good job!
1. It would be nice if further information were supplied for the eight features linked to the building construction. Because the readers may not be familiar with architecture vocabulary, adding narratives to Relative Compactness, Glazing Area, and Glazing Area Distribution can help with better understanding. It will be beneficial to have a broad concept of these features if the units of these features are also provided.
2. It's great that the final report is available in all formats. It's also a good idea to include instructions on how to use scripts for each section of the report.
3. I discovered differences between the jupyter notebook version and the Markdown version in the EDA section of the final report. In the jupyter notebook, the graphic "Correlations among the variables in this study" is missing. The markdown file does not cover the procedures of testing the data types and displaying the descriptions of all 8 features. Furthermore, pairwise scatter plots for 8 features are not very evident in terms of the association between each feature in terms of visualization. The plot of "Correlations among Variables in this Study" performs better. Perhaps just keep one of them.
4. The variety of models chosen for the analysis is outstanding. It is stated in the modeling and analysis section that all Random Forest, Decision Tree, and XGB models perform equally well. Perhaps it would be better to elaborate on why XGB gets the greatest train and test scores.
5. One of the milestone 2 requirements is to "cite at least four external sources." As a result, in addition to the Numpy package, Python, Panda package, Altair package, and the original dataset referenced in the final report, three other external sources can be included for future enhancement.
6. Looking at the scripts in the src diretory, I noticed that inline comments are sparse, which makes the code a bit harder to follow to those who want to contribute to the project. It would be great if more inline comments were added to the scripts, as well as docstrings for functions.
7. Check grammar and spelling (for example, in the final report, # pairwsie scatter plots).
  Attribution
  
  This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Shirley Zhang @shlrley

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelines: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

This is such a fascinating topic and dataset, and I'm really impressed by what you guys have done to explore your research question! Especially in going outside of the scope of our MDS courses and using the "XG Boost" model, I've never heard of it before but it looks super interesting.

The 'Usage' section of the README.md is very clear and easy to follow. However, perhaps you could include a line indicating how to specifically navigate to the cloned repository. For example, instead of "Navigate to your local repository", you could write: cd energy_efficiency_analysis This makes it much more clear what 'local repository' refers to, and ensures the user will start in the right directory.

2) All of the scripts are overall very well commented and easy to follow. However, I noticed that there is some repetition and redundancy of the script descriptions. For example, first there are comments (with #) which describe what the script does, then more descriptions inside """. Perhaps the first few comments could be deleted to be more concise.

3) I liked how modularized each script is, and I think all of the names are very concise and easy to understand. There is however one script that I think could be made more clear. The name of download.py is a bit confusing as there are already data_preprocess.py and download_data.py. If my interpretation is correct, it looks like this script converts an excel file to a csv file. Perhaps the script could be renamed to reflect this?

Furthermore, the description for the purpose of this script (inside of the """) does not seem to match up with what it is doing.

4) Inside of the eda_script_plots_update.py and model_predict.py scripts, it may be better to separate various operations inside of the main function into individual functions outside of the main function (increasing modularization). For example, creating a separate function for creating the plots.

Furthermore, it would be best not to define a function inside of the main function (i.e the save_chart function from Joel Ostblom is defined inside of the main function but could be defined outside).

5) The report is super interesting and I loved that you documented very clearly the steps to follow to recreate the analyses! The plots were also very nice additions. I would love to see a bit more discussion and interpretation of the figures you created in the EDA stage, and how that fits into answering your research question (i.e why did you choose to include these specific plots?).

6) It's great that you guys included a variety of different models. Although the XGBoost model performs very well, perhaps it might still be interesting to look at some hyperparameter optimization for some of your models.

7) The table headings are correctly placed above each table, but I believe you should move the figure titles under each figure.

8) There are a lot of limitations listed, which shows that you guys have thought a lot about your analysis and where your model would not generalize well in.

9) I would love to see more justification on why certain methodology was chosen, for example giving more context into the XGBoost model and how you found it/it's relevance. Same with why you perhaps did not use hyperparameter optimization.

Overall, congratulations on your project so far, I'm excited to see the final product!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Dear reviewers,

We appreciate your constructive feedback. Upon your feedback, we made the following changes.

We created test on scripts : https://github.com/UBC-MDS/energy_efficiency_analysis/commit/504119d20e558f64d59f44e663a493081d93a5f2,.
We moved sub-functions out from the main function in scripts: https://github.com/UBC-MDS/energy_efficiency_analysis/commit/4583620c17e492d41e06117fa3157b7fac95788c
We revised the CONTRIBUTING file: https://github.com/UBC-MDS/energy_efficiency_analysis/commit/3a557d87a0aca1bd1da4f13b48becb1091a6abc3
We recreated a reproducible environment :https://github.com/UBC-MDS/energy_efficiency_analysis/commit/2ec5a79068d6a8c7dd2c3c67e15817aa25a506b3
We add figure and table captions: https://github.com/UBC-MDS/energy_efficiency_analysis/commit/9331f4d5aa876f867bf265ce09525de1310267a8.
We broke long-script into shorter functions: https://github.com/UBC-MDS/energy_efficiency_analysis/commit/4583620c17e492d41e06117fa3157b7fac95788c

Thank you and we greatly appreciate your feedback.

UBC-MDS / data-analysis-review-2022