Submission: Group 6: Wine Quality Analysis

Submitting authors: Felix Li, Gurman Gill, Dia Zavery, Steve He

Repository: https://github.com/DSCI-310-2024/DSCI_310_Group_6/releases/tag/v2.0.0

Abstract/executive summary:

This analysis project attempted to explore the predictive relationships between the physicochemical properties of wine and its quality, utilizing regression analysis and a forward selection algorithm to identify key predictors. Our investigation was motivated by the wine industry's increasing reliance on data analysis and machine learning to enhance wine quality assessments, aiming to decode the complex interplay between a wine's chemical makeup and its sensory appeal. Despite the sophisticated methodology and the comprehensive dataset from the UCI Machine Learning Repository, our findings revealed the model's limited predictive capability, with a low R-squared value highlighting a significant portion of unexplained variability in wine quality. This outcome, while not entirely unexpected given the nuanced nature of wine quality determination, shows the limitations of linear regression models in capturing the intricate factors that influence wine quality. The analysis points to potential areas for improvement, such as incorporating more or better-quality data, considering additional variables, and employing more complex modeling techniques. Our study thus not only contributes to the academic discourse on predictive modeling in the wine industry but also sets the stage for future research that could leverage advanced analytics to unravel the complexities of wine quality assessment, supporting the industry's pursuit of excellence and innovation in wine production and evaluation.

To emphasize our dedication to reproducibility and trustworthiness, our project leverages renv for capturing our R computational environment, ensuring that our analysis can be precisely replicated. Our GitHub repository, structured for clarity and ease of use, combines literate programming within our analysis to integrate code and narrative seamlessly. By documenting our environment and adopting transparent development practices, including issue tracking and contributing guidelines, we not only uphold the integrity of our work but also support the broader data science community in pursuing reproducible research.

Editor: @ttimbers

Reviewer: Prabhjot Singh, Rico Chan, Jackson Siemens, Darwin Zhang

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer: ricochn02

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

The README is quite outdate and missing information on features that were implemented in milestones 2 and 3. Most notably, instructions on Docker and testthat usage are missing. This made the code review more challenging, as it was unclear how to reproduce your analysis.
The repository organization could be streamlined by referring to the structures shown in the class notes and the example repository (breast cancer prediction). Folders like "winefiles" and "models" could probably be subsumed under other folders.
The repository contains several redundant files, such as renv-related files and a duplicate README. I believe the analysis should no longer require renv if Docker is properly implemented.
Tests is missing the testthat subdirectory i.e., tests/testthat
Being pedantic, I noticed your clean_data.R function is implemented as clean_wine_data() in the code - it might be a good idea to rename either for consistency.
The writing of the report is strong, and I found the discussion/conclusion section especially comprehensive and illuminating.

Overall, fantastic work on the project! We're almost at the end of the term :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: dwinzg

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

For the report, it would be beneficial to provide additional description for the figures beyond simply labeling them and stating their title. For instance, in Figure 1, titled "Distribution of Wine Quality," while the overall concept is clear, it would be more informative to include specific details such as the value for 'average quality' and its significance. This would enhance the reader's understanding and interpretation of the figure, contributing to a more comprehensive analysis of the data presented.
There are a few minor formatting issues that could be addressed for better clarity and readability. Firstly, in the Table of Contents, the Introduction is listed as starting on page 1, but it actually begins on page 2. Secondly, the column names in Table 1 appear bunched together, which makes them a bit challenging to read.
Updating the README.md file in folders such as 'tests' could be beneficial to improve understanding of the files and their purposes. Currently, not all of the test code in this folder is documented.
I have noticed a small inconsistency in the naming conventions across the project. For instance, in the 'tests' directory, 'test_scatter_plot.R' uses an underscore, while the other tests use a combination of hyphens and underscores. Similarly, the naming in the data file, 'winequality-*.csv,' could be changed to be more readable and consistent. Otherwise, great job with consistency and descriptiveness throughout, such as within the 'scripts' directory.
The report is very well-written from start to finish, providing a clear and concise summary. It effectively explains the research question, describes the findings, and discusses their implications.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: jacksiemens

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Data Representation and Clarity: The report does an excellent job detailing the methodology and findings, but it could benefit from more nuanced descriptions and interpretations of the data visualizations presented. For instance, the scatter plots and correlation heatmaps are pivotal in understanding the relationships between variables, yet a deeper narrative explaining these relationships’ implications on wine quality predictions could enhance comprehension.
Discussion of Limitations and Assumptions: While the report touches on model limitations and potential improvements, a more structured discussion around these points could provide clearer directions for future research. Specifically, outlining the assumptions made by the linear regression model and how they might not fully capture the complexity of wine quality would be insightful.
Deepening Analysis Interpretation for Practical Use: While the report provides a solid foundation on linking wine’s physicochemical properties with its quality, adding more context on what this analysis means in real-world terms could significantly enhance its value.
Clearer Documentation for Maintenance and Clarity: The functionality and intent behind your custom functions and tests are well-designed. Adding detailed comments explaining the purpose, inputs, and outputs of these functions, as well as the reasoning behind specific tests, could greatly enhance the project’s readability and ease future maintenance efforts.

Overall, very well done! Great work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024