UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 03: Coffee Quality Predictor #19

Open michelle-wms opened 2 years ago

michelle-wms commented 2 years ago

Submitting authors: @berkaybulut @khbunyan @arlincherian @michelle-wms

Repository: https://github.com/UBC-MDS/DSCI_522_GROUP3_COFFEERATINGS Report link: https://rpubs.com/acherian/840439 Abstract/executive summary: In this analysis, we attempt to find a supervised machine learning model which uses the features of the Coffee Quality Dataset, collected by the Coffee Quality Institute in January 2018, to predict the quality of a cup of arabica coffee to answer the research question: given a set characteristics, what is the quality of a cup of arabica coffee?

We begin our analysis by exploring the natural inferential sub-question of which features correlate strongly with coffee quality, which will help to inform our secondary inferential sub-question: which features are most influential in determining coffee quality? We then begin to build our models for testing.

After initially exploring regression based models, Ridge Regression and Random Forest Regressor, our analysis deviated to re-processing our data and exploring classification models. As you will see in our analysis below, predicting a continuous target variable proved quite difficult with many nonlinear features, and was not very interpretable in a real sense of what we were trying to predict. Broadening the target variable and transforming it into classes: “Good” and “Poor”, based on a threshold at the median, helped with these issues.

Our final model, using Random Forest Classification, performed averagely on an unseen test data set, with an ROC score of 0.67. We recommend continuing to study to improve this prediction model before it is put to any use, as incorrectly classifying the quality of coffee could have a large economic impact on a producers income. We have described how one might do that at the end of our analysis.

Editor: @flor14 Reviewer:

imtvwy commented 2 years ago

Data analysis review checklist

Reviewer: @imtvwy (Vanessa Yuen)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. It would be handy if your team can prepare an environment.yaml file for the others to setup the environment with the dependent packages listed in README.
  2. There is a sub-folder /images in the /results folder, but image files are found in both folders. It's a bit confusing for readers while navigating the folder structure.
  3. The code is well-documented with comments and modularized functions such that readers can follow the logic easily.
  4. The wordings in the final report 'Results and Discussion' seem not align with the results in the figure shown above. Similar problem found in the results on README.
  5. The final report is very well-written. I particularly like the Conclusion section with the shortcomings of your current model as well as the ideas for future improvement.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

kphaterp commented 2 years ago

Data analysis review checklist

Reviewer: @kphaterp (Kiran Phaterpekar)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. After reviewing your scripts, it appears to me that this group may be missing some automated tests that verify whether a function works as intended. Specifically, including tests with the assert function in python will allow you to will make your functions more robust and easy to interpret. This note refers to why I did not check-mark the box under Code Quality: Tests.
  2. Although the dependencies are clearly listed in the README file, be sure to also include an environment.yaml file that is accessible in the root of the repository. This will make your project more reproducible. This comment refers to why I did not check-mark the box under Reproducibility: Conditions.
  3. For Figure 1 in the Analysis section of the report, I personally think that the x-axis title and the overall title should be human readable (without underscores). Even though this is EDA, I believe that it is important to make the overall title and x-axis title as readable as possible when including plots in reports that are intended to be read by the general public.
  4. For Figure 2 in the Analysis section of the report, I personally believe that the title could be updated. It's nice to have a takeaway point from a figure in the title. At the very least, the title should be slightly more descriptive. Again this is EDA, so it's definitely not a dealbreaker but it will make the analysis section of the report easier to read and understand.
  5. Although this is not as important as the other points, I think that renaming your github repository might be worth considering. Moreover, I would avoid including DSCI_522_GROUP3 within the github repo's name as it's not really relevant to the project. Again, this is a minor point and it is just a personal preference of mine, so take this advice with a grain of salt.
  6. The report, in my opinion, is outstanding. It is beautifully written, easy to follow along, and tells an engaging story about coffee. Something that really stood out to me is the introduction and how it sets the stage for your analysis. The introduction immediately ropes me in and tells me exactly why predicting coffee quality is important, and which population this impacts the most. This is something that is missing in my own project, so this has truly inspired me to emulate this in the project I am involved in.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lynnwbl commented 2 years ago

Data analysis review checklist

Reviewer: \@lynnwbl (Lynn Wu)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1hr

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shyan0903 commented 2 years ago

Data analysis review checklist

Reviewer: @shyan0903 (Irene Yan)

Conflict of interest

Code of Conduct

General checks

Code quality

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

khbunyan commented 2 years ago

Hello Everyone,

Thank you @shyan0903 @lynnwbl @imtvwy and @kphaterp for your feedback, we really appreciate you taking the time to give us your thoughts. We have integrated the following changes into our project: