Submission: Group 03: Coffee Quality Predictor

Submitting authors: @berkaybulut @khbunyan @arlincherian @michelle-wms

Repository: https://github.com/UBC-MDS/DSCI_522_GROUP3_COFFEERATINGS Report link: https://rpubs.com/acherian/840439 Abstract/executive summary: In this analysis, we attempt to find a supervised machine learning model which uses the features of the Coffee Quality Dataset, collected by the Coffee Quality Institute in January 2018, to predict the quality of a cup of arabica coffee to answer the research question: given a set characteristics, what is the quality of a cup of arabica coffee?

We begin our analysis by exploring the natural inferential sub-question of which features correlate strongly with coffee quality, which will help to inform our secondary inferential sub-question: which features are most influential in determining coffee quality? We then begin to build our models for testing.

After initially exploring regression based models, Ridge Regression and Random Forest Regressor, our analysis deviated to re-processing our data and exploring classification models. As you will see in our analysis below, predicting a continuous target variable proved quite difficult with many nonlinear features, and was not very interpretable in a real sense of what we were trying to predict. Broadening the target variable and transforming it into classes: “Good” and “Poor”, based on a threshold at the median, helped with these issues.

Our final model, using Random Forest Classification, performed averagely on an unseen test data set, with an ROC score of 0.67. We recommend continuing to study to improve this prediction model before it is put to any use, as incorrectly classifying the quality of coffee could have a large economic impact on a producers income. We have described how one might do that at the end of our analysis.

Editor: @flor14 Reviewer:

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @imtvwy (Vanessa Yuen)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

It would be handy if your team can prepare an environment.yaml file for the others to setup the environment with the dependent packages listed in README.
There is a sub-folder /images in the /results folder, but image files are found in both folders. It's a bit confusing for readers while navigating the folder structure.
The code is well-documented with comments and modularized functions such that readers can follow the logic easily.
The wordings in the final report 'Results and Discussion' seem not align with the results in the figure shown above. Similar problem found in the results on README.
The final report is very well-written. I particularly like the Conclusion section with the shortcomings of your current model as well as the ideas for future improvement.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @kphaterp (Kiran Phaterpekar)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

After reviewing your scripts, it appears to me that this group may be missing some automated tests that verify whether a function works as intended. Specifically, including tests with the assert function in python will allow you to will make your functions more robust and easy to interpret. This note refers to why I did not check-mark the box under Code Quality: Tests.
Although the dependencies are clearly listed in the README file, be sure to also include an environment.yaml file that is accessible in the root of the repository. This will make your project more reproducible. This comment refers to why I did not check-mark the box under Reproducibility: Conditions.
For Figure 1 in the Analysis section of the report, I personally think that the x-axis title and the overall title should be human readable (without underscores). Even though this is EDA, I believe that it is important to make the overall title and x-axis title as readable as possible when including plots in reports that are intended to be read by the general public.
For Figure 2 in the Analysis section of the report, I personally believe that the title could be updated. It's nice to have a takeaway point from a figure in the title. At the very least, the title should be slightly more descriptive. Again this is EDA, so it's definitely not a dealbreaker but it will make the analysis section of the report easier to read and understand.
Although this is not as important as the other points, I think that renaming your github repository might be worth considering. Moreover, I would avoid including DSCI_522_GROUP3 within the github repo's name as it's not really relevant to the project. Again, this is a minor point and it is just a personal preference of mine, so take this advice with a grain of salt.
The report, in my opinion, is outstanding. It is beautifully written, easy to follow along, and tells an engaging story about coffee. Something that really stood out to me is the introduction and how it sets the stage for your analysis. The introduction immediately ropes me in and tells me exactly why predicting coffee quality is important, and which population this impacts the most. This is something that is missing in my own project, so this has truly inspired me to emulate this in the project I am involved in.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: \@lynnwbl (Lynn Wu)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1hr

Review Comments:

In my opinion, the project has done well on many points. In the report, the introduction part is very well written. I like how the authors provided context on the economic background of coffee industry and the potential beneficial impact of this analysis. The authors also discussed about trying different models and provided reasons for their final model selection. In addition, further analysis directions and possibilities are demonstrated to readers. Overall, the report is well written and the project is reproducible and inspiring.
Here are some of my suggestions on what could be improved:
1. In the EDA Jupyter Notebook file, the plot that compares the coffee quality between countries seems interesting, however it is not included in the final report. It would be nice if the authors could briefly mention what they found about the differences of coffee quality between countries of origin.
2. In the "Critique, Limitations and Future Improvements" section of the final report, the authors mentioned thatfeatures including "aroma, flavour, aftertaste, acidity, body, balance, uniformity, or sweetness" have been removed from the model. They seem to be relevant features for coffee quality prediction. Maybe some more explanation can be shown on this (e.g. are they removed from domain experts' suggestion? or as a result of EDA conclusions).
3. The table and figure captions could be improved. In the "Results and Discussion" section of the final report, the figure captions for both barplots says "Figure 1", although they are two different figures. The "Regression and Classification Cross-validation Results" table also has "Table2. Table1" before it.
4. Although most of the features used in the analysis are self-explanatory, it would be nice if the authors could briefly introduce on what some of the features mean before analyzing the EDA results. For example, what does category_one_defects,category_two_defects and quakers mean?
5. I suggest the authors to use human readable axis and column names for the plots where they can, for example, removing the underscores in the x-axis label for "Distibution of Target variable, total_cup_points" figure and also for "Correlation Heatmap" column names while capitalizing the first letter for each feature.
6. The notebooks/ folder and its contents could be potentially moved to src/ folder, since the notebooks in this folder seem to be intermediate scripts for producing the visualization and ML results.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @shyan0903 (Irene Yan)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
- [x] Installation instructions: Is there a clearly stated list of dependencies?
- [x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
- [x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
- [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
- [x] Style guidelides: Does the code adhere to well known language style guides?
- [x] Modularity: Is the code suitably abstracted into scripts and functions?
- [ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Reproducibility
- [x] Data: Is the raw data archived somewhere? Is it accessible?
- [x] Computational methods: Is all the source code required for the data analysis available?
- [x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
- [x] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Analysis report
- [x] Authors: Does the report include a list of authors with their affiliations?
- [x] What is the question: Do the authors clearly state the research question being asked?
- [x] Importance: Do the authors clearly state the importance for this research question?
- [x] Background: Do the authors provide sufficient background information so that readers can understand the report?
- [x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
- [x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
- [x] Conclusions: Are the conclusions presented by the authors correct?
- [ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
- [x] Writing quality: Is the writing of good quality, concise, engaging?
Estimated hours spent reviewing: 1.5 hrs

Review Comments:
- You can improve on the reference section since not all softwares are included in the references, such as libraries knitr, kableExtra, and tidy verse.
- I noticed that the figure captions are not consistent in style. For example, figure 1's caption is "Distibution of Target variable" and figure 2's is "Correlation heatmap of numeric features against target". The capitalizations are quite random. You can improve on the report by keeping the captions consistent.
- I find this sentence in the EDA section a little confusing: " The predictive model will learn the target data in this range". Because right before this sentence, there are two ranges being mentioned. It would be better to spell out for the audience what the range you are referring to.
- I think you did a great job on introducing the topic, the dataset, and the methodology. It was easy to follow what you are doing.
- In your result section, you explained the model performance and the important features. But from my point of view, I may also want to know what are the threshold for each feature in the decision tree. I think you can visualize a decision tree to show more detailed information on each feature.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hello Everyone,

Thank you @shyan0903 @lynnwbl @imtvwy and @kphaterp for your feedback, we really appreciate you taking the time to give us your thoughts. We have integrated the following changes into our project:

Regarding comment 1 in this issue:
- we created an environment.yaml file to setup the environment with the dependent packages listed in README: 4f850f06
Regarding comment 2 in this issue:
- we converged the images and results folders into one: 538f23a
Regarding comment 1 in this issue:
- we added automated tests to the processing and EDA scripts that verify custom functions works as intended: 9656b2e & 0a8b65e
Regarding comments 3 & 4 in this issue:
- we updated the EDA graph titles for consistency and human readability in both the Analysis section EDA: a8854d5
Regarding comment 5 in this issue:
- we renamed the github repository, there is no commit for this action
Regarding comment 1 in this issue:
- we included a link to the full EDA within the final report for further reading: 6a171a3
Regarding comments 2 and 4 in this issue:
- we added direction on where to read more about variables in the dataset and updated with short explanations where necessary: 0a8b65e
Regarding comment 3 in this issue:
- we fixed the titles and numbering of the figures in the final report: 034ce42
Regarding comment 1 in this issue:
- we added references knitr, kableExtra, rmarkdown to the final report: f6ee4a4
Regarding comment 3 in this issue:
- we updated the wording of the EDA report: 51aabbc

UBC-MDS / data-analysis-review-2021