Submission: Group 05: chocolate_rating

Submitting authors: @markusnam @robindhillon1 @eyrexh @LishaGG

Repository: https://github.com/UBC-MDS/chocolate_rating Report link: https://github.com/UBC-MDS/chocolate_rating/blob/main/doc/chocolate_rating.html Abstract/executive summary: Here we attempt to build a chocolate numeric rating prediction model by evaluating Support Vector Regression (SVR) and Ridge models on chocolate-related data such as manufacturers, country of bean origin, cocoa percentage and most memorable characteristics. Our best model (SVR) performs fairly well on an unseen test data set. The mean absolute percentage error (MAPE) of SVR is 7.99% compares with 8.22% of the Ridge model. From examining the coefficients generated from the Ridge model, we found that the “raspberry” flavour characteristic and “Fruition” chocolate manufacturer have the highest positive coefficients, while the “medicinal” and “chemical” flavour characteristics have the lowest negative coefficients. The data set used in this project was compiled by Brady Brelinsk Manhattan Chocolate Society, and can be sourced here. Each row in the data set represents an observation of a chocolate product with information like manufacturer, company location, review date, country of bean origin, specific bean origin or bar name, cocoa percent, ingredients, most memorable characteristics and rating.

Editor: @flor14 Reviewer: Spencer Gerlach, Austin Shih, Dhruvi Nishar, Alexander Taciiuk

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: spencergerlach

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Thinks I liked a lot:

The repo is very well organized, with very good documentation about the project overview, and detailed instructions about usage and reproducability.
Very impressed by the flow chart visualization. That is very useful.
Very good visualizations in the report. I also really like the way you've shown which features are important, I think that is a really interesting part of the report.
Overall this is very well done. I'd hope your group gets a very high mark for this with some minor changes outlined below.

Suggested Improvements:

Try to emphasize the research question more clearly. Is there a way to be more specific with the wording of the question? Try to word the question in a specific and measurable way, and emphasize it somewhere in the text with qotations, emphasized text, bullet point, etc.
You have written a good "Summary" section of the analysis report, but it would be nice to also see a separate area that clearly summarizes some of the big findings at the end of the report after all the EDA and model results are presented.
- You've done well to present these throughout the results section, so this would simply be summarizing these in one additional succint section.
(suggestion) Double check all DOIs are included in references, some softwares may need DOIs included (e.g. R).
Round the numeric values in report tables and inline in text to appropriate decimal places. Specific example: Ridge coefficients table in report and the MAPE scores paragraph before Table 1.
Update the README introduction to read more like the report. The report introduction is very nice and concise, where the README is a bit long.
Clearly highlight the research question in the README. I.e. emphasize with quotations, a separate paragraph, bullet point, etc.
(optional) Consider adding some project context to the README (e.g. that this was a project for a course in the UBC MDS program). This might help give some context to people that might view it in the future as one of your projects, if you ever fork it to your personal repos.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: austin-shih

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Good concise introduction on the project and what is to be predicted. The use of a visual flowchart helps to further convey thought process and analysis steps. Would avoid using the term 'Golden Rule' unless explicitly defined. This term is only used in the context of this program's ML course and may be confusing to other users.
Can go into a little more detail on why the MAPE score is used as the metric for predictions. Compare and contrast between other metrics and how they would affect results.
Might be a good idea to include a small sample of the data set in the report, it would make the 'DATA' section of the report more intuitive and easier to follow. The preprocessing steps should be mentioned somewhere in the report as well to give the reader a better understanding on what features the project deems important.
Very good use of visuals in presenting the results and comparing predictions from different models. One thing to note on using the MAPE score, the prediction scores may not be very intuitive for people without ML background as it provides an error score. 'Future Improvements' section gives very good suggestions for further development of this project.
Over all very good project. The question statement is clear and gives adequate explanation on how to get to a result. There appears to be a lot more figures in the results folder which means a lot more insight can be added to the report.

Good job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @ataciuk

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Things that were good about the project:

Big fan of the pipeline flow chart
The repo is well organized
The EDA was informative and useful at feeding into the model

Suggestions for improvement:

Your research question should be more prominent. It is in the middle of the intro paragraph – it should be either the top or the bottom of that paragraph and ideally bolded or otherwise highlighted. The reader shouldn't have to work to find it.
Your writing flips between first person and passive voice, I recommend using first person. "We did hyperparameter optimization via.." is more effective than the passive voice of "The hyperparameter optimization is done via...".
The scripts could use fewer options. For example, the summary script could just use two options: output directory and input directory where files are stored. Then the script could call the specific file name in the main function. This would reduce the possibility of a mistake in the script
I would add the report as a markdown file, not just the .html as these don't render properly on GitHub.
In the dependencies section of the README, I would note that there is a conda environment in /src/for quick reference.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Reviewer: dhruvinishar

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

I really like the project so far. The project workflow is very well designed and easy to recreate and reproduce the results of the analysis with very clear instructions on how to install and setup the analysis. The EDA is also very well designed and conveys the results in a very informative way which is easy to follow.
The research question however, is not clearly defined and emphasised on in the report introduction and the findings of the data analysis.
Adding tests for your code could also really help further improve the reproducibility of your code.
I would have also liked to know more about why you chose to report the MAPE and what the MAPE scores ideally indicate about your analysis and the results you found. Interpreting the results with MAPE scores and what it means could help convey your results to a more general audience as well.
Some scripts do not have code abstracted into more functions, for instance, the code in the main functions for model_svr.py andrating_eda.py could be abstracted into smaller functions instead of having it all in the main() function. This would help with a more structured way of styling the scripts.
The project analysis report otherwise is very well documented, it includes all the required rubrics and the visuals are very helpful. The dataset you chose is also very interesting and the results were presented in an extremely clean and concise format.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

(Peer Review) Explain why the MAPE score is used as the metric for predictions from @austin-shih and @dhruvinishar feedback: https://github.com/UBC-MDS/chocolate_rating/commit/a09a39137c81bbe284c13a7e68024a967f241181
(Peer Review) Round the numeric values in report tables and inline in text from @spencergerlach feedback: https://github.com/UBC-MDS/chocolate_rating/commit/e5e869dcfebfa928c31c2c4cbfd8cfac2e1f21e4
(Peer Review) The scripts could use fewer options from @ataciuk feedback: https://github.com/UBC-MDS/chocolate_rating/commit/49e1d17bc2660b138e882b8f02c161074d69887f
(Peer Review) Add tests for your code from @dhruvinishar feedback: https://github.com/UBC-MDS/chocolate_rating/commit/885ca6671b756f93e1f5411e1d02a0536ea6d89d
(Peer Review) Add the report as a markdown file from @ataciuk feedback: https://github.com/UBC-MDS/chocolate_rating/commit/00a82652957b8e8f890d0fdbca716db171ad5ab0
(Peer Review) Fixed the formatting as well as the commands suggested by @ataciuk via peer review. https://github.com/UBC-MDS/chocolate_rating/commit/4e5093a64ad9f46dfd857006843851e459ec041e
(Peer Review + Milestone 1 Feedback): Fixed ambiguous wording relating to the research question mentioned by the TA and peer review. Now the research question is clear: https://github.com/UBC-MDS/chocolate_rating/commit/d54891d2281328a8ec25bf41058478fdeea3fe2d
(Milestone 2): Added kableExtra in the installation section, along with other packages: https://github.com/UBC-MDS/chocolate_rating/commit/9a9b8748070f6a80ece363ae854634ce1b90d0e4
(Milestone 1): The clarity of README has been improved to reduce grammatical errors and other useful information.
(Milestone 1) Fixed contributing file feedback from Milestone 1. Added core team member contribution: https://github.com/UBC-MDS/chocolate_rating/commit/6ac09e4902213053de9a3819d020dd50a3c55955
(Milestone 1) Added pandoc according to TA feedback: https://github.com/UBC-MDS/chocolate_rating/commit/599b09c18cfaab55f36e0145c5b828bf74599a30
Added docopt to dependencies: https://github.com/UBC-MDS/chocolate_rating/commit/48137d6110e59bf657e348a33f921a8d09bc784d
Made the consequences clearer; minor update since everything seems correct already: https://github.com/UBC-MDS/chocolate_rating/commit/7d6aa598e23ae511b800a12f2fa8dc4d75c2382d

UBC-MDS / data-analysis-review-2022