UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 09: Wine Quality Predictor #2

Open gfairbro opened 2 years ago

gfairbro commented 2 years ago

Submitting authors: gfairbro, paradise1260, Luming-ubc, GWYY

Repository: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/blob/main/reports/wine_quality_predictor_report/_build/html/report_summary.html Abstract/executive summary: Wine is a product that is both an extremely popular and highly consumed product, and one that can be very expensive to buy and lucrative to sell. It is also sold at much higher variety levels than almost any other consumer product - in some supermarkets well over 1000 different wines are stocked.Lockshin, 2003

At the same time, it is also one of the hardest to identify quality ahead of purchase, since you must consume it to decide. The level of quality a consumer might require can even vary wildly depending on the consumption occasion. P. G. Quester and others.

The quality of wine however is difficult to evaluate objectively and is reliant on some very subjective sensory elements. However we believe that this question can be answered by evaluating which physicochemical features are important in determining the quality score of a wine, the wine manufacturers can refine certain wine-making procedures that may yield wines with "promising" properties.

We also believe that by using a quality score that is a human taste output (i.e. each quality score is a median taken over a minimum of 3 sensory assessors) instead of following an objective and rigid standard, which makes wine certification a complicated task, we can better capture the inherent subjectivity of the task. Therefore, attempting to unravel the relationship between physicochemical properties and human taste sensations may also be a direction in the wine certification field Cortez and Others

The data sets were sampled from the red and white vinho verde wines from the North of Portugal, created by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (2009). The data sets were sourced from the UC Irvine Machine Learning Repository and can be found here. One data set is for the red wine, and the other is for the white wine, and both data sets have the same features and target columns. Each row represents a wine sample with its physicochemical properties such as fixed acidity, volatile acidity, etc. The target is a score (integer) ranging from 0 (very bad) to 10 (excellent) that represents the quality of the wine.

Editor: @flor14 Reviewer: Maj_Kyle, Neervaram Abhinandana Kumar_Manju, Nguyen_Jiang, Francis_Victor

gfairbro commented 2 years ago

Couldn't find Victor or Linh Giang's github handle!

gfairbro commented 2 years ago

@gn385x is Linh Giang but i cannot assign her.

manju-abhinandana commented 2 years ago

Data analysis review checklist

Reviewer: @manju-abhinandana

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hrs

Review Comments:

Overall the project is executed well and there is a good flow to the report. It is concise and summarize the project well. The report clearly states the objective, analysis, methodology used for modelling, results as well as limitations. Awesome job on building the jupyter book for final report.

A few suggestions:

  1. environment.yml file could be placed in the root of project. There are a few scripts under src which is not being used. Can it be moved to an archive folder?
  2. The reports folder is not present because of which I was not able to build the jupyter book with command given under Usage.
  3. EDA: I think it would good to mention how large the dataset is. Including this will help someone looking at the report assess how good the results are. The figure 1 showing the class imbalance of wine quality score can be under analysis section.
    1. Also, as a part of EDA it would be good to show correlation between each feature and quality score. This can be useful for feature selection.
    2. I am not sure of the solution adapted to handle class imbalance. Were other approaches like oversampling or adding class weights explored? It would be good to include that as well.
    3. It would be good if the code which gives the table results can be hidden in final report.
    4. The conclusion and limitations could be under a separate subheading.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

gn385x commented 2 years ago

Data analysis review checklist

Reviewer: gn385x

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

What was done well:

  1. I found the research question well formulated and very interesting (as speaking from my own experience I struggle every time coming to a liquor store with a huge selection of wine products to choose from). Given the question, the data chosen is a great one.
  2. The analysis was quite comprehensive and clearly justified. For example, they took care of the problem of class imbalance by re-categorizing the targets classes into groups; or they tried three classification models and applied hyper-parameter optimization to arrive at the best model.
  3. The code was well written and easy to follow.

What could be improved:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. EDA could be more concise with only main visualizations for users to quickly understand the data and identify key patterns.
  2. Regarding the final report in Jupyter book, it only showed “Summary” in the section to search the book (on the left side), which did not to reflect correctly and caused confusion.
  3. To deal with class imbalance, further solutions could be attempted such as changing class weights or under-/over-sampling.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Vikiano commented 2 years ago

Data analysis review checklist

Reviewer: vikiano

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, I think this is great work - kudos team!

  1. I particularly like the detailed, yet concise manner in which you presented your analysis and findings.
  2. The justifications and reasons for almost every decision were clearly stated. This is my most favourite part of your report.
  3. It was an excellent thing you did by trying out different models and selecting the best performer at the end of the day.
  4. However, it will be awesome if you can mention the size of the dataset used for model development. Moreso, the size of the training and test splits. This is to allow for a reader of your report to make an independent judgment on the predictive quality of your model.
  5. It will be great if you can include the authors - names of the group members - in the report summary.
  6. Also, I could glance over some typos in the report. Kindly fix them.

Generally, an outstanding project you got here! Well-done.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Kylemaj commented 2 years ago

Data analysis review checklist

Reviewer: Kylemaj

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 Hours

Review Comments:

Very nicely done! Your group really went above and beyond the minimum requirements and weren’t afraid to use more complex tools. I found that your ideas were communicated effectively in the written sections and code was well documented and easy to read. I can see that you have already incorporated much of the feedback received from other reviewers and it was not easy to find areas for improvement.

Stood out

  1. As a windows user it was particularly helpful that your README contained windows specific dependencies
  2. I'm not sure how you got your links to open inside jupyter instead of another browser tab but its awesome!
  3. The fact that you took the time to test multiple models and include hyperparameter tuning in your process says a lot about how much effort went into this.

Areas for improvement

  1. You may want to consider adding contact information to your CONTRIBUTING.md file. While it is clear who to contact about a code of conduct violation it was less apparent who to contact regarding contributions and support. As an external contributor my first inclination was to open an issue when I couldn't find contact info in the README or CONTRIBUTING files.
  2. CONTRIBUTING file instructs contributors to make minor edits directly to the main branch through github. This make sense for the core team but may be a bit confusing in an outward facing document.
  3. It was a bit confusing for me that your final report was named report summary given that there is another file in the wine_quality_predictor_report with the exact same name. You may want to change the name of your report to something that clearly marks it as the full and final version.
  4. There is a bit of overlap between the Analysis and Results sections of your report. You talk about class imbalance in each section and seem to have a different solution for it in each section. EDA findings also feel a bit out of place in the results section, you may consider moving this part under analysis (though this is purely subjective)
  5. The large gap between your train and test scores is mentioned several times though there did not seem to be discussion about the possibility of overfitting.
  6. Minor grammatical correction in the README. Second line in about section should read "Moreover which of these attributes contributes" rather than "Moreover which of these attribute contributes"
  7. Authors are listed in the readme though I could not find them in the final report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

flor14 commented 2 years ago

Hello group9! Yesterday I spend some time with all the groups present in lab1 providing some suggestions on how to improve the report. I think you were online, so I leave here some minimal comments:

Luming-ubc commented 2 years ago

1. Comments from Kylemaj on CONTRIBUTING.md file

Commit addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/9dc54f947b000ef3cc924db09d0b415a9d7396a6

2. Comments from Kylemaj, gn385x and Vikiano on juputer notebook structure

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/c79a016dd9e0cc0a0f5210b4c4b0810d04d328c4 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/7405529ead6b01df4c60a4cc17b817c245e0cb12

3. Comments from manju-abhinandana on file organization

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/698e791add9e15a47b180ec345971a26d6e9b667 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/13767b692661ab5085bc27ea20accfd3a19e423b https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/fc20795a0f8b854a7d7522cc618553a4b4704e33

4. Comments from manju-abhinandana, Vikiano, and Kylemaj on contents of report

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/57d23b2b8799b07cbf043ad71c7d012946cf05da https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/7405529ead6b01df4c60a4cc17b817c245e0cb12 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/1b6b91ad6cbe26bf3c35b3a790c708439ac68fbb https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/bd50e1aacf21e4fe547866ac7081b953ae83d245 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/d68aff8b25098345aaef4d6fa0039e67bdc04be7

5. Addressing TA's feedback on Milestone 2 release:

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/2dee29a5c4941d463f0664748b65ac28a579ea51