Submission: Group 18: Wine Quality Predictor

ttimbers commented 7 months ago

Submitting authors: Sid Ahuja, Zackarya Hamza, Alexander Dawson

Repository: https://github.com/DSCI-310-2024/DSCI-310-Group-18_wine-quality-predictor/releases/tag/version2.0.0

Abstract/executive summary:

In this project, we build a prediction model using the k-nearest neighbours algorithm which attempts to categorize the quality of a wine based on its' physiochemical properties. We classify wine quality into a binary category: whether it is good or bad. Our classifier performed moderately well on the test set, but further research must be done to improve the model before it is put into production.

The dataset that we used for this project is about white variants of the Portugese "Vinho Verde" wine, which was assembled by Paulo Cortez, A. Cerdeira, F. Almeida, T.Matos, and J.Reis. The dataset was sourced from UCI Machine Learning Repository (Dua and Graff 2017), located here. Each row in this dataset showcases an observation of a white wine, specifically related to its physicochemical and sensory attributes.

Editor: @ttimbers

Reviewer: Zhibek Dzhunusova, Prithvi Sureka, Peter Chen

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

petercmh01 commented 7 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

My suggestions would be:

Find a way to specify input type constraint for the script. In python we were able to do it by for example: @click.option('--test_data_path', help='path of test set data (csv) to read', type=str) I am not sure if you would be able to do the same in R but it is a good way to help reproducability.
If there is no way to specify input type, I would recommand try to add a test for input type check
Can consider to let users to change hyperparameters tuning. for example, let useres choose number of fold for cross validation as input argument of the script.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zhibekD commented 7 months ago