UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP_19: Wine_Quality_Score_Predictor #4

Open Kingslin0810 opened 2 years ago

Kingslin0810 commented 2 years ago

Submitting authors: Kingslin0810, manju-abhinandana, , zackt113, PavelLevchenko

Repository: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor Report link: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/blob/main/doc/Wine_Quality_Score_Predictor_report.md Abstract/executive summary: The aim of this project is to predict the quality of wine on a scale of 0 to 10 given a set of physiochemical features rated by wine test reviewers as inputs. This model is useful to support wine tasting evaluations. The data set for this project is related to red and white vinho verde wine samples, from Portugal, created by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. It is sourced from the UCI Machine Learning Repository and can be found here. Each row in the data set represents label of wine (red or white) and its physicochemical properties which includes fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, and sulphates.

We built a regression model using Ridge, One-Vs-Rest Logistic Regression, SVC, and Random Forest Regressor. Running through the cross-validation, we found the Random Forest Regressor delivers a much higher training score, but there was a clear case of overfitting issue. We then ran hyperparameter optimization in an attempt to improve the model. Unfortunately, the test score with the best hyperparameters was only around 0.53. By analyzing feature coefficients and we reduced to have 10 features. Some features have low coefficients as what was expected from our initial EDA. In the coming weeks, we intend to refine our model further and come out a higher test score if possible.

Editor: @flor14 Reviewer: Zandian_Artan, Maidana_Melisa, Siu_Thomas

thomassiu commented 2 years ago

Data analysis review checklist

Reviewer: @thomassiu

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall the project was well-executed and the final report was organized. It stated clearly the objectives, data used, methodology of carrying out the prediction as well as results and limitations. There were also a lot of references to the researches that made the case solid. Well done team!

  1. Suggest asking contributors to fork the repository first rather than asking them to clone directly. Refer to README.MD. It may be a good idea to modify some instructions in CONTRIBUTING file too.
  2. The script clean_split.py could not be executed with error KeyError: "['quality'] not found in axis in function split_for_train_test. Steps to reproduce:
    • Create a new conda environment with the yaml file provided
    • Execute script download_data.py
    • Execute script clean_split.py

After I changed the separator of reading the csv it worked:

red_df = pd.read_csv(input_red, sep=",")
white_df = pd.read_csv(input_white, sep=",")

Suggest including the end-to-end testing process either create a new environment or remove the existing working files in each execution.

  1. Suggest adding a new parameter for script Wine_Score_EDA.py, that allows contributors to select where to save the output files. This was one of the requirements in milestone 2.
  2. It would be good to align the tense throughout the final report. For example, in the paragraph below the correlation matrix, there are past tense as well as simple future tense.
  3. It would be great if captions are added in the graphs shown in the Analysis section.
  4. Add a flow chart of how the scripts are executed in the README (as per requirement in milestone 2).

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

mmaidana24318 commented 2 years ago

Data analysis review checklist

Reviewer: @mmaidana24318

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

1) Thorough EDA, but I did find the correlation matrix a bit overwhelming. Have you thought about switching it with the heatmap you have in the EDA document? I personally found that one easier to digest. 2) Adding a yaml file is so thoughtful. Very much appreciated! One minor comment about it: I tried to use from my local repo directly (instead of download it and save it) but got an error since the file name is env-wine-prediction.yaml (not wine.yaml as per instructions). It's a minimal fix, but renaming the file (or changing the instruction) to make it match would do things smoother. 3) Some flexibility in the Wine_Score_EDA.py script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results if it doesn't exist. But at minimum I would add that a folder named results is required in the doctopt and the docstrings since right now it's just crashing without that requirement. 4) Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set. 5) Based on the low scores obtained by SVC and Ridge, it seems like your are dealing with features that are non linearly related. Have you considering adding PolynomialFeatures for instance?

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

artanzand commented 2 years ago

Data analysis review checklist

Reviewer: @artanzand

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, this project was put together well and I enjoyed reading a report with good narration.

  1. General comment: figures do not have a number which makes referencing them hard. I recommend adding them. The report references figure numbers, but I was not able to see any labels for the figures. Also, adding a title for all figures would make it easier for the reader to scan the report quicker.
  2. In the EDA section for the faceted plot, it might be better to set the stack for the histogram to false, and use opacity to show the relative frequencies of the two wine types. Currently it is hard for example to see the distribution for red wine.
  3. It would be necessary to mention the size of the data to know how significant your results are.
  4. In the Results & Discussion section, there is mention of this being a regression problem, but then report suggests using a OneVsRestClassifier and SVC. This is confusing to the reader as these two are classification models. Maybe use SVR and mention you are using OneVsRestClassifier with LogisticRegression.
  5. Can MISC folder be deleted if the ipynb files are not used anymore?
  6. In model_fitting.py you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model() is very long for a function and a name like tune_hyperparameters() describes the same.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

plevchen commented 2 years ago
  1. Comment from @andytai7: "What about missing data? How will you handle the missing data if there is some? (doesn't matter if there isn't any you should propose a method before EDA)": Additional function created in EDA script: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/39187d471ba830443aaeeab413284a77f96bd5bf Remarks in final report regarding missing data: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/43d5c286728345c9307702b6973986528e4ea442

  2. Comment from @mmaidana24318: "Some flexibility in the Wine_Score_EDA.py script would be convenient. Ideally I would like to specify the output directory. Alternatively, the script could create the directory results if it doesn't exist" Added "try/except" logic with creating new folder if it does not exist: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/78e0e99a4b01d61870471dcfd82fd20924bb1ea2

  3. Comment from @andytai7: "There could be an imbalance in the classes, in which you would have to under-sample or oversample. Which one will you utilize? What packages will you use will you create synthetic?" Imbalance is handled by modifying this sript https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/55a4c5afde84b3186c96dedd5f6978279fca9926 Updated yaml with library for treating imbalance issue https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/3c3158c543efbfa65b354d14507bc2ab900ee50c

  4. Comment from @artanzand: "In model_fitting.py you might do some clean up to remove the redundant comments and functions that are not used anywhere. For example, the mape function and mape_scorer is never used and can be taken out. tune_hyperparameters_for_best_model() is very long for a function and a name like tune_hyperparameters() describes the same." Made modifications as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/5f8525d3881c5a61628d9e4060b4654f40da00d3

  5. Comment from @mmaidana24318: "Is possible to add DummyRegressor to the list of models? I believe that would help to understand the baseline, and hopefully the complexity of the data set." Added DummyRegressor as suggested: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/ab93a470cd26f8073a449d58b6866d56f78b0be6

  6. Comment from @artanzand: "In the Results & Discussion section, there is mention of this being a regression problem, but then report suggests using a OneVsRestClassifier and SVC. This is confusing to the reader as these two are classification models. Maybe use SVR and mention you are using OneVsRestClassifier with LogisticRegression" LogisticRegression and OneVsRestClassifier were removed and SVR was added: https://github.com/UBC-MDS/DSCI_522_Group19_Wine_Quality_Score_Predictor/commit/dd64d94a234a1c75c6d0e296d80509ed45539f18