UBC-MDS / Wine_Quality_Predictor

MIT License
1 stars 6 forks source link

Feedback from Mark Wang #55

Closed ZIBOWANGKANGYU closed 3 years ago

ZIBOWANGKANGYU commented 3 years ago

Hi everyone,

Your work is really impressive. We are working on the same datasets with slightly different approaches. I really learned a lot from you.

Please see below my feedback:

In general, your code is very well documented.

The usage() part in wine_eda.py is really sweet! I believe it will be really helpful to users. Comments in the plot generation parts of wine_eda.py seem insufficient.

Your code is generally well-organized and well-written. I like the use of try and except to allow users to use new folders.

models_c_revised is not generated pragmatically. Maybe you should consider directly render the table out from the Rmd file. I cannot find the code that generated the model performance metrics in models_c_revised in your repo.

Your analyses are clear and logical. I believe the choice of MLPClassifier since wine scores are really people's perception.

You used processed.csv, instead of processed_train.csv to conduct EDA with. This probably violates the golden rule.

You should include all numeric scores (0 to 10) in your grouping in pre_processing_wine.py since although we only have 3 - 9 in the data we have, we do not know what we will encounter in the deployment set. Also, it would probably be better if you include grouping as part of your pipeline. "Knowing" there are only 3-9 scores slightly violates the golden rule since you technically are not supposed to see the test split.

I am confused about your final model selection. Does not the random forest model have better test f1 and accuracy than MLP?

You write very well and the figures are generally well-thought. In README.md, you used the word "unbiased" to describe your model. I agree that your model will not have biased caused by humans tasting. However, "unbiased" is a term that mean certain things in statistics. Your model may not be able to and do not have to, completely prevent them. Maybe you can use another word instead.

You generated a visualization for the distribution of wine types (distribution_of_type_of_wine). I think this is unnecessary. It should be consolidated with the other two charts for the distribution of wine qualities.

Also, for distribution_of_numeric_features.png, there is no need to repeat titles for each sub-plot.

Figure 1 in the "Results & Discussion" section of your final report is not really an analysis of your model. I would be more interested in feature importance. But understandably in DSCI573, we have not learned about feature importance in neural networks yet.

athy9193 commented 3 years ago

Hi @ZIBOWANGKANGYU, thank you so much for your feedback. To follow up:

Again, thank you so much Mark for your detailed feedback, we've learn greatly from this :) Cheers!