Submission: Group 20: Red Wine Quality Prediction

Submitting authors: @alexzhang0825 @sungg888 @AlysenTownsley @nicolebid

Repository: https://github.com/UBC-MDS/Red-Wine-Quality-Prediction Report link: https://ubc-mds.github.io/Red-Wine-Quality-Prediction/ Abstract/executive summary: In this project our group seeks to use machine learning algorithms to predict wine quality (scale of 0 to 10) using physiochemical properties of the liquid. We use a train-test split and cross-validation to simulate the model encountering unseen data. We use and tune the parameters of several classification models: logistic regression, decision tree, kNN, and SVM RBF to see which one has the highest accuracy, and then deploy the winner onto the test set. The final test set accuracy is around 62 percent. Depending on the standard, this can be decent or poor. However, a more important note is that for the really extreme quality ones (below 5 or above 6), the model was unable to identify quite a few of them correctly, suggesting that it is not very robust to outliers. We include a final discussion section on some of the potential causes for this performance as well as proposed solutions for any future analysis.

Editor: @ttimbers Reviewer: Shizhe Zhang, Tony Shum, Jake Barnabe, Yiwei Zhang

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: zywkloo

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

As shown in the image below, the src directory only contains 1 file. In that case, all test cases trying to call the helpers from src/ would not be executed, as they will fail in the dependency import stage (most likely).
The test files could have more consistent naming standard, like test-word1_word2_word3.py
May consider referencing the environment.yaml in the docker file, to enhance the code reusability. For example:
```
FROM quay.io/jupyter/minimal-notebook:2023-11-22

WORKDIR /home/jovyan

COPY environment.yaml .

RUN conda env update --file environment.yaml
```
Despite some minor naming and importing issues, this project stands out for comprehensive documentation, code quality, reproducibility, and thorough analysis reporting.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

The report is very thorough and explains the analysis in great detail. However, it features many technical details that may be too complex to add to the understanding of the average reader. For instance, the section at the end of the introduction outlining which software packages were used in the analysis may not be meaningful to many readers who aren't technically trained in data science. It may be worth considering removing that section to keep the report concise and understandable to all readers.
In an effort to improve reproducibility, it may be worth adding edge cases to the function modules to make sure the functions are, for example, passed inputs of the correct data type, or are non-null, etc. One example of a useful edge case could be raising an error if the dataframe read in in the data_split.py script does not have the correct number of columns.
The repository is structured in a largely clear and easy-to-navigate manner. Though, I noticed that all but one of the scripts are stored in the scripts directory. It may be a good idea to save all the scripts in the scripts folder as it may reduce the chances of readers missing the read_view.py script (as I almost did).

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: zhang-shizhe

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1 hour

Review Comments:

The README.md file is well-detailed and organized. Great work on this!

It seems like there's a typo in the test_set_deployment.py script that caused an error during its execution.

Traceback (most recent call last):
File "/home/jovyan/work/scripts/test_set_deployment.py", line 72, in <module>
test_set_deployment()
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
     ^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/scripts/test_set_deployment.py", line 58, in test_set_deployment
X_test = pd.read_csv((x_test_folder + 'x_test.csv'))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
return _read(filepath_or_buffer, kwds)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
self._engine = self._make_engine(f, self.engine)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
self.handles = get_handle(
               ^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 863, in get_handle
handle = open(
         ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '../results/tables/x_test.csv'

Given the models that being used, it might be worth trying to apply a log transformation to some of the features, which seem to have a pronounced skew. Features with high variance or long tails can dominate the distance calculations in the feature space, so transforming these features to approximate a normal distribution might help improve the performance.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @tonyshumlh

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

CONTRIBUTING.md is mentioned in README but the file does not exist
Tests are available for some of the scripts but not all of them
For running the scripts, some scripts take positional arguments but some take options
The instruction in README is very clear!!
They have clearly stated the limitation and the possible improvement actions to be taken in the future.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023