UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 4: Poisonous_Mushroom_Predictor #25

Open Kylemaj opened 2 years ago

Kylemaj commented 2 years ago

Submitting authors: @dol23asuka @Kylemaj @Mahm00d27

Repository: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor Report link: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/doc/Poisonous_Mushroom_Predictor_Report.md Abstract/executive summary: As mushrooms have distinctive characteristics which help in identifying whether they are poisons or edible. In this project we have built a logistic regression classification model which can use several morphological characteristics of mushrooms to predict whether an observed mushroom is toxic or edible (non-toxic). Exploratory data analysis revealed definite distinctions between our target classes, as well as highlighting several key patterns which could serve as strong predictors. On the test data set of 1,625 observations our model performed extremely well with a 99% recall score and a 100% precision score. The model correctly classified 863 edible and 761 toxic mushrooms. One false negative result was produced (toxic mushroom identified as non-toxic). In the context of this problem, a false negative could result in someone being seriously or even fatally poisoned. We must therefore be far more concerned with minimizing false negatives than false positives. Given this precedent, we may consider tuning the threshold of our model in order to minimize false negatives at the potential cost of increasing false positives. Moving forward, we would like to further optimize our model, investigating if we could potentially get similar performance with less features. Finally, we would like to evaluate how our model performs on real observations from the field rather than hypothetical data.

Editor: @flor14 Reviewers: Cui_Vera, Ye_Wanying, Taskaev_Vadim, Lv_Kingslin

Kingslin0810 commented 2 years ago

Data analysis review checklist

Reviewer: @Kingslin0810

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, the project is well-executed, and the final report clearly states the objective, data used, methodology of carrying out the prediction as well as results and limitations. There are also a lot of references to the research that makes the case solid. Good job team!

  1. very handy to have env-mushroom.yaml created for audiences to reproduce the data project; however, the 'python' packages in the README.md and env-mushroom.yaml are redundant. Also, suggest to briefly summarize a high level 1-2 sentence descriptions at your top right-hand side of the 'About' section in your data project repo.
  2. the environment file URL link in your CONTRIBUTING.md is broken. Suggest asking contributors to fork the repository first then create pull request for contributions.
  3. the README.md doesn't summarize how well your model is performing; I also found it's a little bit overwhelming for all information provided. Would it be redundant to your final report? Suggest to briefly introduce your data initiative and the model performance.
  4. a few latex equations didn't render properly in your final report, such as $$ recall = \frac{TP}{TP+FN} = \frac{3152}{3152+2} \approx 0.99 $$
  5. the cross-validation scores and test data scores are very close to 1 for accuracy, recall, precision and f1 scores; potentially your model is overfitting, and this issue hasn’t been examined or mentioned in your final report. Furthermore, there are 1 false negative (type II error) and 0 false positive (type I error ) while evaluating your model on the test data. Given a very small proportion of poisonous mushroom in your test data, and I am not sure if your model is biased for purpose of concluding the performance. Have you tested for 40-60 split or setting for different random state? Are you getting the same result? I am also wondering if your model will have a great performance score for other similar data. Additionally, it's worthwhile to mention you might target to reduce the type I error over type II error because otherwise the result might put someone’s life in danger by predicting poisonous mushrooms as edible ones.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

GloriaWYY commented 2 years ago

Data analysis review checklist

Reviewer: @GloriaWYY

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Impressive work is shown in this group project. The project repository is well-organized in a way that resources are properly named and can be easily accessed. The complete_eda.ipynb is well-done by following the checklist, which shows that you understood your data well before doing any machine learning.

I would like to bring some issues to your attention to help you improve your project:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

suuuuperNOVA commented 2 years ago

Data analysis review checklist

Reviewer: @suuuuperNOVA

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 30 minutes

Review Comments:

  1. Table 2 and Table 3 which are about the confusion matrices seem confusing to me. I think it is better to clarify which side is the predicted value and which side is the true value.
  2. As we've learned in DSCI 531 when presenting the graph, it's better not to have '_' in the labels. The labels are needed to modify.
  3. Figure 2 needs a reference.
  4. I am interested in the coefficients of the features to see which are strong factors and which are weak. It could make the report tell a complete story if there could be some text about the trained model.
  5. I saw there is a dummy classifier as your baseline which works not well, but I think introducing models such as decision tree classifier as another baseline will be more convincing to show the power of the model selected.
  6. In the EDA part, each feature is investigated, but it's unnecessary to show all of them. Only important discoveries need to be mentioned in the report. Plus, features are analyzed individually, but the correlation among features is also essential. I think it would be interesting if to show the target class with two features.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

vtaskaev1 commented 2 years ago

Data analysis review checklist

Reviewer: vtaskaev1

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

dol23asuka commented 2 years ago

From GloriaWYY's comment:

"In src/preprocessor.py, you fit and transform the entire training set, and then pass this model to cross_validation.py to assess the model's performance. This would potentially break the Golden Rule because your preprocessor learns information from the whole training set, which means during cross-validation, information leaks from cross-validation split. This can be solved by removing preprocessor.fit_transform(df1, df2); you do not need this because with your pipeline defined in cross_validation.py, cross-validation will be performed properly and automatically for you, and you do not need to manually transform the data beforehand."

We commented the line of code preprocessor.fit_transform(df1, df2) so it will not be run. Please check here: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/src/preprocessor.py

dol23asuka commented 2 years ago

All four reviewers have mentioned that our project is lacking of automation. It is because we haven't finished the Makefile when they were reviewing our work.

We have now added the Makefile and Dockerfile and wrote down the instructions of how to use them in details under the Usage part in README.md, please check here:

https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor

Kylemaj commented 2 years ago

Thank you for the feedback. I will update this post as issues are addressed.

Regarding points in comment 1

  1. I have fixed the environment link in our README file.
  2. Due to very inconsistent latex rendering on different machines we have decided to display affected figures with a more simple syntax.
  3. I have conducted a review of our final report and I believe that most of these points were addressed in the Prediction and Limitations & Assumptions sections with the exception of overfitting. I have updated the report to include our thoughts regarding overfitting and several small edits which will hopefully add clarity to the sections mentioned.

Regarding points in comment 2

  1. Wherever possible I have condensed our README to a more concise high-level summary.

Regarding checklist items from multiple comments

  1. I have added several tests to our scripts
mahm00d27 commented 2 years ago

I thankfully acknowledge the reviewers for taking the time to comment. We have tried our best to address the comments. Added some explanation in the "Limitation and conclusion" section to address real-life implication of the model.