Submission: Group 09: Wine Quality Predictor

UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 09: Wine Quality Predictor #2

Open gfairbro opened 2 years ago

gfairbro commented 2 years ago

Submitting authors: gfairbro, paradise1260, Luming-ubc, GWYY

Repository: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/blob/main/reports/wine_quality_predictor_report/_build/html/report_summary.html Abstract/executive summary: Wine is a product that is both an extremely popular and highly consumed product, and one that can be very expensive to buy and lucrative to sell. It is also sold at much higher variety levels than almost any other consumer product - in some supermarkets well over 1000 different wines are stocked.Lockshin, 2003

At the same time, it is also one of the hardest to identify quality ahead of purchase, since you must consume it to decide. The level of quality a consumer might require can even vary wildly depending on the consumption occasion. P. G. Quester and others.

The quality of wine however is difficult to evaluate objectively and is reliant on some very subjective sensory elements. However we believe that this question can be answered by evaluating which physicochemical features are important in determining the quality score of a wine, the wine manufacturers can refine certain wine-making procedures that may yield wines with "promising" properties.

We also believe that by using a quality score that is a human taste output (i.e. each quality score is a median taken over a minimum of 3 sensory assessors) instead of following an objective and rigid standard, which makes wine certification a complicated task, we can better capture the inherent subjectivity of the task. Therefore, attempting to unravel the relationship between physicochemical properties and human taste sensations may also be a direction in the wine certification field Cortez and Others

The data sets were sampled from the red and white vinho verde wines from the North of Portugal, created by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (2009). The data sets were sourced from the UC Irvine Machine Learning Repository and can be found here. One data set is for the red wine, and the other is for the white wine, and both data sets have the same features and target columns. Each row represents a wine sample with its physicochemical properties such as fixed acidity, volatile acidity, etc. The target is a score (integer) ranging from 0 (very bad) to 10 (excellent) that represents the quality of the wine.

Editor: @flor14 Reviewer: Maj_Kyle, Neervaram Abhinandana Kumar_Manju, Nguyen_Jiang, Francis_Victor

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

gfairbro commented 2 years ago

Couldn't find Victor or Linh Giang's github handle!

gfairbro commented 2 years ago

@gn385x is Linh Giang but i cannot assign her.

manju-abhinandana commented 2 years ago

Data analysis review checklist

Reviewer: @manju-abhinandana

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hrs

Review Comments:

Overall the project is executed well and there is a good flow to the report. It is concise and summarize the project well. The report clearly states the objective, analysis, methodology used for modelling, results as well as limitations. Awesome job on building the jupyter book for final report.

A few suggestions:

environment.yml file could be placed in the root of project. There are a few scripts under src which is not being used. Can it be moved to an archive folder?
The reports folder is not present because of which I was not able to build the jupyter book with command given under Usage.
EDA: I think it would good to mention how large the dataset is. Including this will help someone looking at the report assess how good the results are. The figure 1 showing the class imbalance of wine quality score can be under analysis section.
1. Also, as a part of EDA it would be good to show correlation between each feature and quality score. This can be useful for feature selection.
2. I am not sure of the solution adapted to handle class imbalance. Were other approaches like oversampling or adding class weights explored? It would be good to include that as well.
3. It would be good if the code which gives the table results can be hidden in final report.
4. The conclusion and limitations could be under a separate subheading.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

gn385x commented 2 years ago

Data analysis review checklist

Reviewer: gn385x

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

What was done well:

I found the research question well formulated and very interesting (as speaking from my own experience I struggle every time coming to a liquor store with a huge selection of wine products to choose from). Given the question, the data chosen is a great one.
The analysis was quite comprehensive and clearly justified. For example, they took care of the problem of class imbalance by re-categorizing the targets classes into groups; or they tried three classification models and applied hyper-parameter optimization to arrive at the best model.
The code was well written and easy to follow.

What could be improved:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

EDA could be more concise with only main visualizations for users to quickly understand the data and identify key patterns.
Regarding the final report in Jupyter book, it only showed “Summary” in the section to search the book (on the left side), which did not to reflect correctly and caused confusion.
To deal with class imbalance, further solutions could be attempted such as changing class weights or under-/over-sampling.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Vikiano commented 2 years ago

Data analysis review checklist

Reviewer: vikiano

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well-known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance of this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, I think this is great work - kudos team!

I particularly like the detailed, yet concise manner in which you presented your analysis and findings.
The justifications and reasons for almost every decision were clearly stated. This is my most favourite part of your report.
It was an excellent thing you did by trying out different models and selecting the best performer at the end of the day.
However, it will be awesome if you can mention the size of the dataset used for model development. Moreso, the size of the training and test splits. This is to allow for a reader of your report to make an independent judgment on the predictive quality of your model.
It will be great if you can include the authors - names of the group members - in the report summary.
Also, I could glance over some typos in the report. Kindly fix them.

Generally, an outstanding project you got here! Well-done.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Kylemaj commented 2 years ago

Data analysis review checklist

Reviewer: Kylemaj

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 Hours

Review Comments:

Very nicely done! Your group really went above and beyond the minimum requirements and weren’t afraid to use more complex tools. I found that your ideas were communicated effectively in the written sections and code was well documented and easy to read. I can see that you have already incorporated much of the feedback received from other reviewers and it was not easy to find areas for improvement.

Stood out

As a windows user it was particularly helpful that your README contained windows specific dependencies
I'm not sure how you got your links to open inside jupyter instead of another browser tab but its awesome!
The fact that you took the time to test multiple models and include hyperparameter tuning in your process says a lot about how much effort went into this.

Areas for improvement

You may want to consider adding contact information to your CONTRIBUTING.md file. While it is clear who to contact about a code of conduct violation it was less apparent who to contact regarding contributions and support. As an external contributor my first inclination was to open an issue when I couldn't find contact info in the README or CONTRIBUTING files.
CONTRIBUTING file instructs contributors to make minor edits directly to the main branch through github. This make sense for the core team but may be a bit confusing in an outward facing document.
It was a bit confusing for me that your final report was named report summary given that there is another file in the wine_quality_predictor_report with the exact same name. You may want to change the name of your report to something that clearly marks it as the full and final version.
There is a bit of overlap between the Analysis and Results sections of your report. You talk about class imbalance in each section and seem to have a different solution for it in each section. EDA findings also feel a bit out of place in the results section, you may consider moving this part under analysis (though this is purely subjective)
The large gap between your train and test scores is mentioned several times though there did not seem to be discussion about the possibility of overfitting.
Minor grammatical correction in the README. Second line in about section should read "Moreover which of these attributes contributes" rather than "Moreover which of these attribute contributes"
Authors are listed in the readme though I could not find them in the final report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

flor14 commented 2 years ago

Hello group9! Yesterday I spend some time with all the groups present in lab1 providing some suggestions on how to improve the report. I think you were online, so I leave here some minimal comments:

I really like the report! Not much to say
Table1 columns have different width. This affects the legibility.
Table 3 it is not centered as the rest of the figures. You can ask questions you have about Docker or the assignments in the Slack channel Congratulations for al your hard work 🥇 Note: These are suggestions only

Luming-ubc commented 2 years ago

1. Comments from Kylemaj on CONTRIBUTING.md file

You may want to consider adding contact information to your CONTRIBUTING.md file. While it is clear who to contact about a code of conduct violation it was less apparent who to contact regarding contributions and support. As an external contributor my first inclination was to open an issue when I couldn't find contact info in the README or CONTRIBUTING files.
CONTRIBUTING file instructs contributors to make minor edits directly to the main branch through github. This make sense for the core team but may be a bit confusing in an outward facing document.

Commit addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/9dc54f947b000ef3cc924db09d0b415a9d7396a6

Contact information is added (email); Restructured the CONTRIBUTING.md a bit so that it works as public-facing document instead of only for the maintainer team.

2. Comments from Kylemaj, gn385x and Vikiano on juputer notebook structure

It was a bit confusing for me that your final report was named report summary given that there is another file in the wine_quality_predictor_report with the exact same name. You may want to change the name of your report to something that clearly marks it as the full and final version.
Regarding the final report in Jupyter book, it only showed “Summary” in the section to search the book (on the left side), which did not to reflect correctly and caused confusion.

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/c79a016dd9e0cc0a0f5210b4c4b0810d04d328c4 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/7405529ead6b01df4c60a4cc17b817c245e0cb12

We were able to figure out why jupyter book was behaving that way, added an index page that included our title, author and date information. This had the added bonus of including Summary in the RHS table of contents. The revised report could be found here.

3. Comments from manju-abhinandana on file organization

environment.yml file could be placed in the root of project. There are a few scripts under src which is not being used. Can it be moved to an archive folder?

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/698e791add9e15a47b180ec345971a26d6e9b667 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/13767b692661ab5085bc27ea20accfd3a19e423b https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/fc20795a0f8b854a7d7522cc618553a4b4704e33

We moved the environment.yml file to the root of the project. Also, We removed unnecessary files from src directory.

4. Comments from manju-abhinandana, Vikiano, and Kylemaj on contents of report

EDA: I think it would good to mention how large the dataset is. Including this will help someone looking at the report assess how good the results are. The figure 1 showing the class imbalance of wine quality score can be under analysis section.
It will be great if you can include the authors - names of the group members - in the report summary.
Authors are listed in the readme though I could not find them in the final report.
The conclusion and limitations could be under a separate subheading.
There is a bit of overlap between the Analysis and Results sections of your report. You talk about class imbalance in each section and seem to have a different solution for it in each section.

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/57d23b2b8799b07cbf043ad71c7d012946cf05da https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/7405529ead6b01df4c60a4cc17b817c245e0cb12 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/1b6b91ad6cbe26bf3c35b3a790c708439ac68fbb https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/bd50e1aacf21e4fe547866ac7081b953ae83d245 https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/d68aff8b25098345aaef4d6fa0039e67bdc04be7

We added the breakdown of the examples in the train and test data; We moved the figure of class imbalance to analysis section; We added an index page that included our project title, author and date information when rebuilding the jupyter book; We added two separate subheadings for conclusion and limitations; Each team member of us reviewed and edited the final report to remove overlapped contents and fixed minor issues. The revised report could be found here.

5. Addressing TA's feedback on Milestone 2 release:

You're using try, except everywhere instead of checking if a file exists when outputting (and maybe reading) the files. You could instead check if the file exists and create it if it doesn't. Using try except in production except for when dealing with 3rd party stuff is not a good practice.

Commits addressing the comments: https://github.com/UBC-MDS/DSCI_522_group09_Wine_Quality_Predictor/commit/2dee29a5c4941d463f0664748b65ac28a579ea51

Thank you for your suggestions. We decided to use if statement to check if a file exists before exporting it, and we adopts for loops to make our codes DRY.