Open ttimbers opened 3 months ago
docker-compose.yml
file. Because of this, I was unable to reproduce the analysis following the instructions specified on the README.md
file. Recall that you need a docker-compose.yml
file to use the docker compose
commands specified in the README.md
file, and having just Dockerfile
isn't sufficient.LICENSE.MD
file in the Workflow for publishing the dockerfile section, you forgot to change the copyright from "Copyright (c) 2024 Tiffany Timbers" to "Copyright (c) 2024 README.md
file, it says to run docker compose run --rm analysis-env make clean-all
to remove previous versions of the analysis. But your Makefile does not contain a make clean-all
option, but instead only make clean
thus that wouldn't work. Another example is that inside your first script, 01_download_data.R
, the documentation comments in lines 4 and 5 say "This script calculates the mean for a specified column from wine.data", which obviously is not what the script does.Overall, great job! You should all be proud of the work that's been completed so far! The content of the analysis in particular is well done, just be sure that someone who has never worked with the analysis can run it entirely from top to bottom following your instructions.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
In the comment above, edsters78 does a good job of explaining the reproducibility issue caused by the docker-compose-reliant instructions in the Usage
section of the README
, despite the lack of a docker-compose.yml
file. I tried to reproduce the analysis directly from the Dockerfile
, in the hopes I would chance upon something that would be of use when you make adjustments to your Docker setup.
I built the docker image by running the following:
docker pull ajackman48/dsci-310-group-5:latest
docker run -it --rm ajackman48/dsci-310-group-5:latest
However, when I tried to run the function tests in the R terminal, I ran into the following error:
testthat::test_dir("tests/testthat")
Error in loadNamespace(x) : there is no package called ‘testthat’
At this point, I wasn’t sure if the error I was running into was caused by a local dependency issue or the Dockerfile
, so I decided to call it a day so as not to provide any irrelevant feedback. Just to verify that the testthat
package is being installed correctly in the container, I would recommend running the docker build in a clean environment from top to bottom. As for the reproducibility concern, while not strictly necessary in this context or required for the assignment, I do think that incorporating a docker-compose.yml
file into your workflow is the simplest fix; it would ensure standardization across different operating systems, and manage dependencies better than a lone Dockerfile
.
Your Dockerfile
has a fair amount of repeated code. A new layer is created in a docker image each time a RUN command is added, making a bigger image that requires more time to be built. If you consolidate your RUN commands, it will optimize the docker image size and build time. I also noticed that Devtools
is installed twice in your Dockerfile
(lines 19 and 22).
In the script 01_download_data.R
, it is stated that "This script calculates the mean for a specified column from wine.data.” However, it appears that the only functionality of this script is downloading and saving the data locally. This is probably just a copy-paste error.
In the PDF rendering of your report, Figure 2
is placed on the page following its cross reference in the text. This honestly has little impact on the readability of the report, but I found out how to fix this a while ago so I thought I’d include it for your consideration. LaTex figures are 'floating' by default, so their placement in rendered documents depends on both figure size and the size of any abutting content. PDFs don't handle this as well as HTML documents during rendering, because PDFs are divided into pages, imposing layout constraints on the content. I fixed it by adding the fig-pos argument into the YAML front matter, which allows you to specify that figure placement in the rendered document should be consistent with their placement in the quarto document:
format:
pdf:
fig-pos: "H"
This comment is more in regards to the data analysis than the workflow as a whole. In the Methods
section of the report, it is mentioned that steps are taken to balance the classes of the classification variable. I think it would benefit readers to provide more information on the specific processes employed to balance the classes, as well as any feature engineering that took place. Additionally, you may want to consider evaluating your model with metrics such as F1 score, precision, recall, or AUC score, which are generally favoured over accuracy when classes are imbalanced. The Analysis
section of the report references a confusion matrix - possibly you generated one in a previous iteration of the project, and it did not make it into the final report. In case you are referring to Table 2: Model Evaluation Metrics
, this is not actually a confusion matrix. Table 2
lists predictions with their actual values and counts, while confusion matrices display counts of True Positive, False Positive, True Negative, and False Negative predictions. It is possible to generate a confusion matrix for a 3-class classification, and these counts could help you calculate the other metrics I mentioned.
Overall, I thought the entire analysis was really well done! It is tidy and well-organized, the report confers thoughtful commentary on meaningful results, and errors are generally small and self-contained. Please forgive me if any of my comments are flawed - I did my project in Python, so there was a bit of a knowledge gap regarding some of the R-specific naming conventions and packages, but reviewing your work has proven to be an excellent study tool! Please do let me know if any of my comments are not clear; I’d be happy to provide a more robust explanation. Well done group 5!
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Missing scatter plot path Error Catching : The error handling in the R/create_boxplot.R
function is commendable, particularly in checking for the existence of the specified directory and creates it if necessary. However, this functionality is currently absent in the R/create_scatter.R
file. To ensure consistent the necessary error handling , it's recommended to incorporate a similar path check within the create_scatter
file. Additionally, the corresponding unit tests for tests/testthat/helper-create_scatter.R
should be expanded to include path validation tests as well. (mirroring the approach taken in the helper-create_boxplot.R
tests.)
Duplicate test file mistake: another little incorrect detail within the project is that: A redundancy was identified within the tests/testthat
folder. Two files share the similar name: test-train_and_evaluate_knn_model.R
and test_train_and_evaluate_knn_model.R
. It appears that one of these files is a duplicate as the code is identical. To ensure clarity and avoid potential confusion issues, it's recommended to delete the redundant file (test_train_and_evaluate_knn_model.R
).
Missing Comments in Test Files: While most functions are well-described with both doc-string and inline comments, some files like tests/testthat/test-calc_stats.R
and tests/testthat/helper-create_output_dir.R
lack comments. And this could cause difficult for audience to understand functionality quickly. Enhancing the documentation for these functions with details about expected behavior and tested errors would provide greater clarity and make future modifications easier.
Redundant Test Data Creation: The tests/testthat/helper-create_scatter_plot.R
function currently includes code for creating test data wine_test_data
. Since a helper function already exists for this purpose in tests/testthat/helper-create_output_dir.R
, it's recommended to consolidate these functions to improve code organization and reduce redundancy.
Visual regression testing with vdiffr
could be used to further test the visualization plot.
In general, it is a well-written report that effectively communicates the key findings of the wine classification project using KNN. The report has clear structure and concise writing, making it easy to follow the analysis process. Copped with effectively utilizes exploratory data analysis as well as the a good understanding of KNN model and feature selecting.
Incorporating the suggested improvements would provide a more comprehensive picture of the methodology and testing of the package.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
docker compose run --rm analysis-env make clean-all
to remove previous versions of the analysis. However, the Makefile only contains the clean
target, which might cause confusion or errors if users attempt to use clean-all
.read_csv
, converting cultivar
to factor) are common across several scripts. Consider extracting these into a separate function that can be called consistently Overall, this project has a solid foundation with clear research objectives and modelling approach. The report provides detailed explanations of the data and methodology and presents clear visualizations from exploratory analysis. The structure of the project is clear and easy to follow, with necessary documentation provided.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Submitting authors: Zhibek Dzhunusova, Andrea Jackman, Kaylan Wallace, Chuxuan Zhou
Repository: https://github.com/DSCI-310-2024/dsci-310_group-5_predicting-wine-cultivars/releases/tag/milestone_3
Abstract/executive summary:
In this project, we asked whether we could predict what cultivar a wine was derived from based on its chemical properties.
The data was sourced from the UCI Machine Learning Repository (Aberhard et al. 1991). It contains data about 178 wines from Italy derived from three different cultivars. Each row represents the chemical and physical properties of a different wine, such as the alcohol concentration, magnesium level, and hue.
Using a k-nearest neighbors algorithm we attempted to predict the cultivar of an unknown wine based on physiochemical properties such as alcohol content and magnesium levels. Our model achieved 98% accuracy on our test data, but struggled to accurately predict the origin of wines from cultivar 2. This demonstrates that most cultivars are strongly distinguished by their physiochemical properties with some overlap between similar cultivars. Our analysis effectively identified chemical properties that strongly distinguish cultivars. This provides valueable information for winemakers about which grapes to cultivate to achieve their desired wine characteristics and informs future innovation in the wine industry.
Editor: @ttimbers
Reviewer: Edwin Yeung, Steve He, Alex Lin, Lillian Milroy