Open NikitaShymberg opened 2 years ago
(Note: This comments are time-based)
split.py
isn't check to verify that your expected input file is a .csv
file. What happens when it is not a .csv
file. I would suggest you write a check (a try and except ) to catch unexpected cases and inform the user to use a .csv
file..csv
file has the expected column names as used in your analysis. I will suggest writing a check to verify the input csv
has the expected data column. eda.ipynb
. Analysis report
Kind remarks: There may be typos here 😄 .
This was derived from the JOSE review checklist and the ROpenSci review checklist.
OVERALL COMMENTS:
Overall, I really like your project. You did a great job at describing how the problem you were solving was important and how your findings could be utilized. The quality of wine did not seem important to me, but you made it very clear how your work would be impactful to the industry.
The final report would be stronger if it included more specific details. When reading, it felt like the modeling results were overlooked which is one of the most interesting part of your project. There were a lot of general statements about the models and methods being used but what I would have found more interesting were details about which tuned parameters were the best, how all the model scores compare, and what specifics you decided on for data cleaning. I was not convinced your final model was the best because I had no data to tell me it was better than other models you tried. Remember that your reader likely has a background with data analysis methods and they are going to your report to look for specific on what was done and what you found.
It would be good to check over your rendered final report. A bunch of the figures looked like they were cut off and it would be helpful if you added captions. Also, you worked so hard on this so make sure you put your names in the report!
I really liked how the ReadMe was set up with clear sections. It made the document clear and easy to follow. In the usage section, adding some style formatting might make it clearer as to what is code and what are comments.
You clearly did a lot of work on the model. The model script is very easy to follow and shows all the hyperparameter tuning that went into finding the best model for the problem. It would be really nice to see a summary of this work in the final report. I think it would make it clearer that the model you chose was the best. You put a ton of work into your model so it would be nice to showcase it more in the report!
Adding more specifics to your summary would be benifical. As the reader I want to immediatly know some specifics about your EDA, hyperparameter tuning and most importantly, your results. At least in the scientific community, it is expected that reading the summary/abstract should tell you a condensed version of what was done, what the findings were, and what conclusions you came to.
SMALL ERRORS I NOTICED:
Overall, I enjoy reading your interesting data analysis. I am impressed by how you challenged yourself to work with different algorithms to perform the task. However, in my opinion, there are still rooms for improvements as follows:
environment.yaml
file in your root folder, it would also be nice if you can instruct how to install the environment in README.md
file. README.md
file. README.md
file. For example, in src folder, the script's name is analyze.py
, while in README.md
file it is analysis.py
. The same for eda.py
script. Methods
part of the final report, you indicate that your task is to focus on what white wines feature are important to get the promising result. It would more clear if you could summarise your conclusion of feature importance in this part, which features are choosen, and why. Methods
: References can be made in this part, for example sklearn
. Results and Discussion
part of the final report, it could be great if the results of other models are also displayed to prove that how better KNN model is. The best parameter has not been discussed yet. This was derived from the JOSE review checklist and the ROpenSci review checklist.
Could not complete this part
README and folder organization
environment.yaml
file. Would be easier to see this information in the README under Dependencies.conda env create -f environment.yaml
). People may not know how to run these commands.Replicating the project
conda activate somelier
in the instructions. If someone has not used it before, they may be unsure what to do. The same goes for deactivation process after the project is done.The report
Side note regarding Make
---
and the folder path with the word mypath
for privacy reasons. This was not listed in the instructions and I believe is unnecessary for this review but maybe it could he helpful for you:
$ make all
python src/download_data.py --url=http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip --path=data/raw/
Data downloaded to data/raw/
python src/split.py data/raw/winequality/winequality-white.csv data/processed
python src/ml_models.py data/processed results/raw_results
Starting to train Dummy...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 2 is smaller than n_iter=10. Running 2 iterations. For exhaustive searches, use GridSearchCV.
warnings.warn(
Finished training Dummy!
------------
Starting to train Ridge...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 6 is smaller than n_iter=10. Running 6 iterations. For exhaustive searches, use GridSearchCV.
warnings.warn(
Finished training Ridge!
------------
Starting to train Random Forest...
Finished training Random Forest!
------------
Starting to train KNN...
Finished training KNN!
------------
Starting to train Bayes...
Finished training Bayes!
------------
Starting to train SVM...
Finished training SVM!
------------
python src/analyze.py --r_path=results
best_model.csv created at location /results/
python src/EDA.py data/processed/X_train.csv data/processed/y_train.csv results
Traceback (most recent call last):
File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 93, in <module>
main()
File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 44, in main
chart1.save(
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\vegalite\v4\api.py", line 476, in save
result = save(**kwds)
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\save.py", line 112, in save
mimebundle = spec_to_mimebundle(
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\mimebundle.py", line 60, in spec_to_mimebundle
return altair_saver.render(spec, format, mode=mode, **kwargs)
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\_core.py", line 257, in render
mimebundle.update(saver.mimebundle(fmt))
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_saver.py", line 90, in mimebundle
bundle[mimetype] = self._serialize(fmt, "mimebundle")
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 114, in _serialize
spec = self._vl2vg(spec)
File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 68, in _vl2vg
return json.loads(vg_json)
File "C:\Users\---\miniconda3\envs\somelier\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
make: *** [Makefile:17: relationship_between_individual_features_and_the_quality_3.png] Error 1
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thank you all for the feedback!
Submitting authors: @NikitaShymberg @gutermanyair @aldojasb @SonQBChau Repository: https://github.com/UBC-MDS/predicting_wine_quality Report link: https://github.com/UBC-MDS/predicting_wine_quality/blob/main/doc/Quality_white_wine_predictor.pdf Abstract/executive summary:
This report uses the white wine database from "vinho verde" to predict the quality based on physicochemical properties. Quality is a subjective measure, given by the average grade of three experts.
Before starting the predictions, the report performs an exploratory data analysis (EDA) to look for features that may provide good prediction results, and also makes an short explanation about the metrics used in the models. In data preparation, the database are downloaded and processed in python. In this phase, the training and testing sets are created and they will be used during the model building.
There's a brief explanation of the models used in this report. Other important machine learning concepts, such as ensemble and cross validation, are also discussed.
The results section presents the best model for predicting quality and discuss why it was chosen for this purpose.
Editor: @flor14 Reviewer: