DSCI-310-2024 / data-analysis-review-2024

2 stars 0 forks source link

Submission: Group 5: Predicting Wine Cultivars #5

Open ttimbers opened 3 months ago

ttimbers commented 3 months ago

Submitting authors: Zhibek Dzhunusova, Andrea Jackman, Kaylan Wallace, Chuxuan Zhou

Repository: https://github.com/DSCI-310-2024/dsci-310_group-5_predicting-wine-cultivars/releases/tag/milestone_3

Abstract/executive summary:

In this project, we asked whether we could predict what cultivar a wine was derived from based on its chemical properties.

The data was sourced from the UCI Machine Learning Repository (Aberhard et al. 1991). It contains data about 178 wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as the alcohol concentration, magnesium level, and hue.

Using a k-nearest neighbors algorithm we attempted to predict the cultivar of an unknown wine based on physiochemical properties such as alcohol content and magnesium levels. Our model achieved 98% accuracy on our test data, but struggled to accurately predict the origin of wines from cultivar 2. This demonstrates that most cultivars are strongly distinguished by their physiochemical properties with some overlap between similar cultivars. Our analysis effectively identified chemical properties that strongly distinguish cultivars. This provides valueable information for winemakers about which grapes to cultivate to achieve their desired wine characteristics and informs future innovation in the wine industry.

Editor: @ttimbers

Reviewer: Edwin Yeung, Steve He, Alex Lin, Lillian Milroy

edsters78 commented 3 months ago

Data analysis review checklist

Reviewer: edsters78

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

  1. The overall content of the analysis report was very well done! But there is one glaring issue regarding the reproducibility of the report, and that is the lack of a docker-compose.yml file. Because of this, I was unable to reproduce the analysis following the instructions specified on the README.md file. Recall that you need a docker-compose.yml file to use the docker compose commands specified in the README.md file, and having just Dockerfile isn't sufficient.
  2. At the bottom of your LICENSE.MD file in the Workflow for publishing the dockerfile section, you forgot to change the copyright from "Copyright (c) 2024 Tiffany Timbers" to "Copyright (c) 2024 ".
  3. You may want to spend some time to check for little incorrect details scattered throughout the project. For example, I noticed under the instructions to reproduce the analysis in the README.md file, it says to run docker compose run --rm analysis-env make clean-all to remove previous versions of the analysis. But your Makefile does not contain a make clean-all option, but instead only make clean thus that wouldn't work. Another example is that inside your first script, 01_download_data.R, the documentation comments in lines 4 and 5 say "This script calculates the mean for a specified column from wine.data", which obviously is not what the script does.

Overall, great job! You should all be proud of the work that's been completed so far! The content of the analysis in particular is well done, just be sure that someone who has never worked with the analysis can run it entirely from top to bottom following your instructions.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Lilmillyrock commented 2 months ago

Data analysis review checklist

Reviewer: lilmillyrock

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 3

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. In the comment above, edsters78 does a good job of explaining the reproducibility issue caused by the docker-compose-reliant instructions in the Usage section of the README, despite the lack of a docker-compose.yml file. I tried to reproduce the analysis directly from the Dockerfile, in the hopes I would chance upon something that would be of use when you make adjustments to your Docker setup.

    I built the docker image by running the following: docker pull ajackman48/dsci-310-group-5:latest docker run -it --rm ajackman48/dsci-310-group-5:latest

    However, when I tried to run the function tests in the R terminal, I ran into the following error: testthat::test_dir("tests/testthat") Error in loadNamespace(x) : there is no package called ‘testthat’

    At this point, I wasn’t sure if the error I was running into was caused by a local dependency issue or the Dockerfile, so I decided to call it a day so as not to provide any irrelevant feedback. Just to verify that the testthat package is being installed correctly in the container, I would recommend running the docker build in a clean environment from top to bottom. As for the reproducibility concern, while not strictly necessary in this context or required for the assignment, I do think that incorporating a docker-compose.yml file into your workflow is the simplest fix; it would ensure standardization across different operating systems, and manage dependencies better than a lone Dockerfile.

  2. Your Dockerfile has a fair amount of repeated code. A new layer is created in a docker image each time a RUN command is added, making a bigger image that requires more time to be built. If you consolidate your RUN commands, it will optimize the docker image size and build time. I also noticed that Devtools is installed twice in your Dockerfile (lines 19 and 22).

  3. In the script 01_download_data.R, it is stated that "This script calculates the mean for a specified column from wine.data.” However, it appears that the only functionality of this script is downloading and saving the data locally. This is probably just a copy-paste error.

  4. In the PDF rendering of your report, Figure 2 is placed on the page following its cross reference in the text. This honestly has little impact on the readability of the report, but I found out how to fix this a while ago so I thought I’d include it for your consideration. LaTex figures are 'floating' by default, so their placement in rendered documents depends on both figure size and the size of any abutting content. PDFs don't handle this as well as HTML documents during rendering, because PDFs are divided into pages, imposing layout constraints on the content. I fixed it by adding the fig-pos argument into the YAML front matter, which allows you to specify that figure placement in the rendered document should be consistent with their placement in the quarto document:
    format: pdf: fig-pos: "H"

  5. This comment is more in regards to the data analysis than the workflow as a whole. In the Methods section of the report, it is mentioned that steps are taken to balance the classes of the classification variable. I think it would benefit readers to provide more information on the specific processes employed to balance the classes, as well as any feature engineering that took place. Additionally, you may want to consider evaluating your model with metrics such as F1 score, precision, recall, or AUC score, which are generally favoured over accuracy when classes are imbalanced. The Analysis section of the report references a confusion matrix - possibly you generated one in a previous iteration of the project, and it did not make it into the final report. In case you are referring to Table 2: Model Evaluation Metrics, this is not actually a confusion matrix. Table 2 lists predictions with their actual values and counts, while confusion matrices display counts of True Positive, False Positive, True Negative, and False Negative predictions. It is possible to generate a confusion matrix for a 3-class classification, and these counts could help you calculate the other metrics I mentioned.

Overall, I thought the entire analysis was really well done! It is tidy and well-organized, the report confers thoughtful commentary on meaningful results, and errors are generally small and self-contained. Please forgive me if any of my comments are flawed - I did my project in Python, so there was a bit of a knowledge gap regarding some of the R-specific naming conventions and packages, but reviewing your work has proven to be an excellent study tool! Please do let me know if any of my comments are not clear; I’d be happy to provide a more robust explanation. Well done group 5!

alexlin519 commented 2 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

In general, it is a well-written report that effectively communicates the key findings of the wine classification project using KNN. The report has clear structure and concise writing, making it easy to follow the analysis process. Copped with effectively utilizes exploratory data analysis as well as the a good understanding of KNN model and feature selecting.

Incorporating the suggested improvements would provide a more comprehensive picture of the methodology and testing of the package.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Steve-JJ-He commented 2 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Overall, this project has a solid foundation with clear research objectives and modelling approach. The report provides detailed explanations of the data and methodology and presents clear visualizations from exploratory analysis. The structure of the project is clear and easy to follow, with necessary documentation provided.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.