Feedback from Mark Wang

Hi everyone,

Your work is really impressive. We are working on the same datasets with slightly different approaches. I really learned a lot from you.

Please see below my feedback:

Documentation

In general, your code is very well documented.

The usage() part in wine_eda.py is really sweet! I believe it will be really helpful to users. Comments in the plot generation parts of wine_eda.py seem insufficient.

Code

Your code is generally well-organized and well-written. I like the use of try and except to allow users to use new folders.

models_c_revised is not generated pragmatically. Maybe you should consider directly render the table out from the Rmd file. I cannot find the code that generated the model performance metrics in models_c_revised in your repo.

Analysis and reasoning

Your analyses are clear and logical. I believe the choice of MLPClassifier since wine scores are really people's perception.

You used processed.csv, instead of processed_train.csv to conduct EDA with. This probably violates the golden rule.

You should include all numeric scores (0 to 10) in your grouping in pre_processing_wine.py since although we only have 3 - 9 in the data we have, we do not know what we will encounter in the deployment set. Also, it would probably be better if you include grouping as part of your pipeline. "Knowing" there are only 3-9 scores slightly violates the golden rule since you technically are not supposed to see the test split.

I am confused about your final model selection. Does not the random forest model have better test f1 and accuracy than MLP?

Communication

You write very well and the figures are generally well-thought. In README.md, you used the word "unbiased" to describe your model. I agree that your model will not have biased caused by humans tasting. However, "unbiased" is a term that mean certain things in statistics. Your model may not be able to and do not have to, completely prevent them. Maybe you can use another word instead.

You generated a visualization for the distribution of wine types (distribution_of_type_of_wine). I think this is unnecessary. It should be consolidated with the other two charts for the distribution of wine qualities.

Also, for distribution_of_numeric_features.png, there is no need to repeat titles for each sub-plot.

Figure 1 in the "Results & Discussion" section of your final report is not really an analysis of your model. I would be more interested in feature importance. But understandably in DSCI573, we have not learned about feature importance in neural networks yet.

Hi @ZIBOWANGKANGYU, thank you so much for your feedback. To follow up:

we have revised our script to produce models_c_revised programmatically too.
On your question on the wine-category, we understand from the data source that the wine-quality scale for expert to grade the wine from 0 to 10. However, the observe data report scale from 3-9, we think this maybe because the expert did not choose to give score of extreme limit like 0, 10. Hence we respect the original data and its description, although we also noticed the same point we when looked at the data the first time :)
Moreover, the white wine and the red wine are separate and we decided to combine them to have a bigger data size for our analysis. In the same line of trying having sizable categories to conduct our analysis, the transformation we did was only to assign the quality score into bigger quality bucket, which are changes within each individual rows. Therefore, we do not think we are violate the golden rule here. We conducted the same transformation for test set and will do the same for potential deployment set. Hope that answers your concern. Thank you. This is a great discussion point!
For why we chose MLP over random forest model despite the latter has better test f1. We will provide more justification in our report but briefly we think the result generate by MLP are more stable.
For communication, we took into account your comment on wordings and revise the report accordingly. Thank you so much.
On the visualization for the distribution of wine types, this is suggested by Eric(TA) so we just keep it there for clarity :)

Again, thank you so much Mark for your detailed feedback, we've learn greatly from this :) Cheers!

UBC-MDS / Wine_Quality_Predictor

Feedback from Mark Wang #55