Open michelle-wms opened 2 years ago
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
assert
function in python will allow you to will make your functions more robust and easy to interpret. This note refers to why I did not check-mark the box under Code Quality: Tests. This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
In my opinion, the project has done well on many points. In the report, the introduction part is very well written. I like how the authors provided context on the economic background of coffee industry and the potential beneficial impact of this analysis. The authors also discussed about trying different models and provided reasons for their final model selection. In addition, further analysis directions and possibilities are demonstrated to readers. Overall, the report is well written and the project is reproducible and inspiring.
Here are some of my suggestions on what could be improved:
category_one_defects
,category_two_defects
and quakers
mean?This was derived from the JOSE review checklist and the ROpenSci review checklist.
[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
You can improve on the reference section since not all softwares are included in the references, such as libraries knitr, kableExtra, and tidy verse.
I noticed that the figure captions are not consistent in style. For example, figure 1's caption is "Distibution of Target variable" and figure 2's is "Correlation heatmap of numeric features against target". The capitalizations are quite random. You can improve on the report by keeping the captions consistent.
I find this sentence in the EDA section a little confusing: " The predictive model will learn the target data in this range". Because right before this sentence, there are two ranges being mentioned. It would be better to spell out for the audience what the range you are referring to.
I think you did a great job on introducing the topic, the dataset, and the methodology. It was easy to follow what you are doing.
In your result section, you explained the model performance and the important features. But from my point of view, I may also want to know what are the threshold for each feature in the decision tree. I think you can visualize a decision tree to show more detailed information on each feature.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Hello Everyone,
Thank you @shyan0903 @lynnwbl @imtvwy and @kphaterp for your feedback, we really appreciate you taking the time to give us your thoughts. We have integrated the following changes into our project:
Regarding comment 1 in this issue:
Regarding comment 2 in this issue:
Regarding comments 3 & 4 in this issue:
Regarding comment 5 in this issue:
Regarding comment 1 in this issue:
Regarding comments 2 and 4 in this issue:
Regarding comment 3 in this issue:
Regarding comment 1 in this issue:
Regarding comment 3 in this issue:
Submitting authors: @berkaybulut @khbunyan @arlincherian @michelle-wms
Repository: https://github.com/UBC-MDS/DSCI_522_GROUP3_COFFEERATINGS Report link: https://rpubs.com/acherian/840439 Abstract/executive summary: In this analysis, we attempt to find a supervised machine learning model which uses the features of the Coffee Quality Dataset, collected by the Coffee Quality Institute in January 2018, to predict the quality of a cup of arabica coffee to answer the research question: given a set characteristics, what is the quality of a cup of arabica coffee?
We begin our analysis by exploring the natural inferential sub-question of which features correlate strongly with coffee quality, which will help to inform our secondary inferential sub-question: which features are most influential in determining coffee quality? We then begin to build our models for testing.
After initially exploring regression based models, Ridge Regression and Random Forest Regressor, our analysis deviated to re-processing our data and exploring classification models. As you will see in our analysis below, predicting a continuous target variable proved quite difficult with many nonlinear features, and was not very interpretable in a real sense of what we were trying to predict. Broadening the target variable and transforming it into classes: “Good” and “Poor”, based on a threshold at the median, helped with these issues.
Our final model, using Random Forest Classification, performed averagely on an unseen test data set, with an ROC score of 0.67. We recommend continuing to study to improve this prediction model before it is put to any use, as incorrectly classifying the quality of coffee could have a large economic impact on a producers income. We have described how one might do that at the end of our analysis.
Editor: @flor14 Reviewer: