Submission: GROUP_19: Wine_Quality_Score_Predictor

Submitting authors: Kingslin0810, manju-abhinandana, , zackt113, PavelLevchenko

Repository: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor Report link: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/blob/main/doc/Wine_Quality_Score_Predictor_report.md Abstract/executive summary: The aim of this project is to predict the quality of wine on a scale of 0 to 10 given a set of physiochemical features rated by wine test reviewers as inputs. This model is useful to support wine tasting evaluations. The data set for this project is related to red and white vinho verde wine samples, from Portugal, created by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. It is sourced from the UCI Machine Learning Repository and can be found here. Each row in the data set represents label of wine (red or white) and its physicochemical properties which includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, and sulphates.

We built a regression model using Ridge, One-Vs-Rest Logistic Regression, SVC, and Random Forest Regressor. Running through the cross-validation, we found the Random Forest Regressor delivers a much higher training score, but there was a clear case of overfitting issue. We then ran hyperparameter optimization in an attempt to improve the model. Unfortunately, the test score with the best hyperparameters was only around 0.53. By analyzing feature coefficients and we reduced to have 10 features. Some features have low coefficients as what was expected from our initial EDA. In the coming weeks, we intend to refine our model further and come out a higher test score if possible.

Editor: @flor14 Reviewer: Zandian_Artan, Maidana_Melisa, Siu_Thomas

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @thomassiu

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall the project was well-executed and the final report was organized. It stated clearly the objectives, data used, methodology of carrying out the prediction as well as results and limitations. There were also a lot of references to the researches that made the case solid. Well done team!

Suggest asking contributors to fork the repository first rather than asking them to clone directly. Refer to README.MD. It may be a good idea to modify some instructions in CONTRIBUTING file too.
The script clean_split.py could not be executed with error KeyError: "['quality'] not found in axis in function split_for_train_test. Steps to reproduce:
- Create a new conda environment with the yaml file provided
- Execute script download_data.py
- Execute script clean_split.py

After I changed the separator of reading the csv it worked:

red_df = pd.read_csv(input_red, sep=",")
white_df = pd.read_csv(input_white, sep=",")

Suggest including the end-to-end testing process either create a new environment or remove the existing working files in each execution.

Suggest adding a new parameter for script Wine_Score_EDA.py, that allows contributors to select where to save the output files. This was one of the requirements in milestone 2.
It would be good to align the tense throughout the final report. For example, in the paragraph below the correlation matrix, there are past tense as well as simple future tense.
It would be great if captions are added in the graphs shown in the Analysis section.
Add a flow chart of how the scripts are executed in the README (as per requirement in milestone 2).

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @mmaidana24318

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

1) Thorough EDA, but I did find the correlation matrix a bit overwhelming. Have you thought about switching it with the heatmap you have in the EDA document? I personally found that one easier to digest. 2) Adding a yaml file is so thoughtful. Very much appreciated! One minor comment about it: I tried to use from my local repo directly (instead of download it and save it) but got an error since the file name is env-wine-prediction.yaml (not wine.yaml as per instructions). It's a minimal fix, but renaming the file (or changing the instruction) to make it match would do things smoother. 3) Some flexibility in the Wine_Score_EDA.py script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results if it doesn't exist. But at minimum I would add that a folder named results is required in the doctopt and the docstrings since right now it's just crashing without that requirement. 4) Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set. 5) Based on the low scores obtained by SVC and Ridge, it seems like your are dealing with features that are non linearly related. Have you considering adding PolynomialFeatures for instance?

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @artanzand

Conflict of interest

[ x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, this project was put together well and I enjoyed reading a report with good narration.

General comment: figures do not have a number which makes referencing them hard. I recommend adding them. The report references figure numbers, but I was not able to see any labels for the figures. Also, adding a title for all figures would make it easier for the reader to scan the report quicker.
In the EDA section for the faceted plot, it might be better to set the stack for the histogram to false, and use opacity to show the relative frequencies of the two wine types. Currently it is hard for example to see the distribution for red wine.
It would be necessary to mention the size of the data to know how significant your results are.
In the Results & Discussion section, there is mention of this being a regression problem, but then report suggests using a OneVsRestClassifier and SVC. This is confusing to the reader as these two are classification models. Maybe use SVR and mention you are using OneVsRestClassifier with LogisticRegression.
Can MISC folder be deleted if the ipynb files are not used anymore?
In model_fitting.py you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model() is very long for a function and a name like tune_hyperparameters() describes the same.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Comment from @andytai7: "What about missing data? How will you handle the missing data if there is some? (doesn't matter if there isn't any you should propose a method before EDA)": Additional function created in EDA script: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/39187d471ba830443aaeeab413284a77f96bd5bf Remarks in final report regarding missing data: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/43d5c286728345c9307702b6973986528e4ea442
Comment from @mmaidana24318: "Some flexibility in the Wine_Score_EDA.py script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results if it doesn't exist" Added "try/except" logic with creating new folder if it does not exist: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/78e0e99a4b01d61870471dcfd82fd20924bb1ea2
Comment from @andytai7: "There could be an imbalance in the classes, in which you would have to under-sample or oversample. Which one will you utilize? What packages will you use will you create synthetic?" Imbalance is handled by modifying this sript https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/55a4c5afde84b3186c96dedd5f6978279fca9926 Updated yaml with library for treating imbalance issue https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/3c3158c543efbfa65b354d14507bc2ab900ee50c
Comment from @artanzand: "In model_fitting.py you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model() is very long for a function and a name like tune_hyperparameters() describes the same." Made modifications as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/5f8525d3881c5a61628d9e4060b4654f40da00d3
Comment from @mmaidana24318: "Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set." Added DummyRegressor as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/ab93a470cd26f8073a449d58b6866d56f78b0be6
Comment from @artanzand: "In the Results & Discussion section, there is mention of this being a regression problem, but then report suggests using a OneVsRestClassifier and SVC. This is confusing to the reader as these two are classification models. Maybe use SVR and mention you are using OneVsRestClassifier with LogisticRegression" LogisticRegression and OneVsRestClassifier were removed and SVR was added: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/dd64d94a234a1c75c6d0e296d80509ed45539f18

UBC-MDS / data-analysis-review-2021