Open Kingslin0810 opened 2 years ago
Overall the project was well-executed and the final report was organized. It stated clearly the objectives, data used, methodology of carrying out the prediction as well as results and limitations. There were also a lot of references to the researches that made the case solid. Well done team!
fork
the repository first rather than asking them to clone directly. Refer to README.MD
. It may be a good idea to modify some instructions in CONTRIBUTING file too.KeyError: "['quality'] not found in axis
in function split_for_train_test. Steps to reproduce:
After I changed the separator of reading the csv it worked:
red_df = pd.read_csv(input_red, sep=",")
white_df = pd.read_csv(input_white, sep=",")
Suggest including the end-to-end testing process either create a new environment or remove the existing working files in each execution.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1) Thorough EDA, but I did find the correlation matrix a bit overwhelming. Have you thought about switching it with the heatmap you have in the EDA document? I personally found that one easier to digest.
2) Adding a yaml file is so thoughtful. Very much appreciated! One minor comment about it: I tried to use from my local repo directly (instead of download it and save it) but got an error since the file name is env-wine-prediction.yaml
(not wine.yaml
as per instructions). It's a minimal fix, but renaming the file (or changing the instruction) to make it match would do things smoother.
3) Some flexibility in the Wine_Score_EDA.py
script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results
if it doesn't exist. But at minimum I would add that a folder named results
is required in the doctopt and the docstrings since right now it's just crashing without that requirement.
4) Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set.
5) Based on the low scores obtained by SVC and Ridge, it seems like your are dealing with features that are non linearly related. Have you considering adding PolynomialFeatures for instance?
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Overall, this project was put together well and I enjoyed reading a report with good narration.
model_fitting.py
you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model()
is very long for a function and a name like tune_hyperparameters()
describes the same.This was derived from the JOSE review checklist and the ROpenSci review checklist.
Comment from @andytai7: "What about missing data? How will you handle the missing data if there is some? (doesn't matter if there isn't any you should propose a method before EDA)": Additional function created in EDA script: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/39187d471ba830443aaeeab413284a77f96bd5bf Remarks in final report regarding missing data: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/43d5c286728345c9307702b6973986528e4ea442
Comment from @mmaidana24318: "Some flexibility in the Wine_Score_EDA.py script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results if it doesn't exist" Added "try/except" logic with creating new folder if it does not exist: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/78e0e99a4b01d61870471dcfd82fd20924bb1ea2
Comment from @andytai7: "There could be an imbalance in the classes, in which you would have to under-sample or oversample. Which one will you utilize? What packages will you use will you create synthetic?" Imbalance is handled by modifying this sript https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/55a4c5afde84b3186c96dedd5f6978279fca9926 Updated yaml with library for treating imbalance issue https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/3c3158c543efbfa65b354d14507bc2ab900ee50c
Comment from @artanzand: "In model_fitting.py you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model() is very long for a function and a name like tune_hyperparameters() describes the same." Made modifications as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/5f8525d3881c5a61628d9e4060b4654f40da00d3
Comment from @mmaidana24318: "Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set." Added DummyRegressor as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/ab93a470cd26f8073a449d58b6866d56f78b0be6
Comment from @artanzand: "In the Results & Discussion section, there is mention of this being a regression problem, but then report suggests using a OneVsRestClassifier and SVC. This is confusing to the reader as these two are classification models. Maybe use SVR and mention you are using OneVsRestClassifier with LogisticRegression" LogisticRegression and OneVsRestClassifier were removed and SVR was added: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/dd64d94a234a1c75c6d0e296d80509ed45539f18
Submitting authors: Kingslin0810, manju-abhinandana, , zackt113, PavelLevchenko
Repository: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor Report link: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/blob/main/doc/Wine_Quality_Score_Predictor_report.md Abstract/executive summary: The aim of this project is to predict the quality of wine on a scale of 0 to 10 given a set of physiochemical features rated by wine test reviewers as inputs. This model is useful to support wine tasting evaluations. The data set for this project is related to red and white vinho verde wine samples, from Portugal, created by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. It is sourced from the UCI Machine Learning Repository and can be found here. Each row in the data set represents label of wine (red or white) and its physicochemical properties which includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, and sulphates.
We built a regression model using Ridge, One-Vs-Rest Logistic Regression, SVC, and Random Forest Regressor. Running through the cross-validation, we found the Random Forest Regressor delivers a much higher training score, but there was a clear case of overfitting issue. We then ran hyperparameter optimization in an attempt to improve the model. Unfortunately, the test score with the best hyperparameters was only around 0.53. By analyzing feature coefficients and we reduced to have 10 features. Some features have low coefficients as what was expected from our initial EDA. In the coming weeks, we intend to refine our model further and come out a higher test score if possible.
Editor: @flor14 Reviewer: Zandian_Artan, Maidana_Melisa, Siu_Thomas