Open Kylemaj opened 2 years ago
Overall, the project is well-executed, and the final report clearly states the objective, data used, methodology of carrying out the prediction as well as results and limitations. There are also a lot of references to the research that makes the case solid. Good job team!
env-mushroom.yaml
created for audiences to reproduce the data project; however, the 'python' packages in the README.md and env-mushroom.yaml
are redundant. Also, suggest to briefly summarize a high level 1-2 sentence descriptions at your top right-hand side of the 'About' section in your data project repo. CONTRIBUTING.md
is broken. Suggest asking contributors to fork the repository first then create pull request for contributions.README.md
doesn't summarize how well your model is performing; I also found it's a little bit overwhelming for all information provided. Would it be redundant to your final report? Suggest to briefly introduce your data initiative and the model performance. $$ recall = \frac{TP}{TP+FN} = \frac{3152}{3152+2} \approx 0.99 $$
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Impressive work is shown in this group project. The project repository is well-organized in a way that resources are properly named and can be easily accessed. The complete_eda.ipynb
is well-done by following the checklist, which shows that you understood your data well before doing any machine learning.
I would like to bring some issues to your attention to help you improve your project:
README.md
is very informative, but I would suggest giving a brief summary over your project instead of writing detailed Background, Introduction, and Data sections (which are exactly what you have in your final report). You can take a look at the example repository breast_cancer_predictor
.src
and results
folders, and it seems that I cannot locate how some of the artifacts are generated (e.g. I assume results/img/correlation.png
is probably generated through pandas profiling; however, it is not extracted explicitly in the script). This might cause some of the results or analysis unreproducible, which is why I did not check the box for Automation in the Reproducibility section. I suggest using Make to automate your complete analysis and see if some intermediates are missing.src/preprocessor.py
, you fit and transform the entire training set, and then pass this model to cross_validation.py
to assess the model's performance. This would potentially break the Golden Rule because your preprocessor learns information from the whole training set, which means during cross-validation, information leaks from cross-validation split. This can be solved by removing preprocessor.fit_transform(df1, df2)
; you do not need this because with your pipeline defined in cross_validation.py
, cross-validation will be performed properly and automatically for you, and you do not need to manually transform the data beforehand.target='e'
and target='p'
. As we learned in DSCI531, would it be better to group the targets so that each target class has the same baseline, or to use stack=False
to have one bar overlaying on the other? Secondly, for almost all features, there are many categories with only few counts, and this raises my concern on whether your model can learn something meaningful from this limited amount of data for these categories? Will overfitting be a problem? Lastly, I noticed in the Limitation section, you look back on the data to see if there is class imbalance or not. Instead of doing this, usually it would be good if you can include some aspects of how your targets are distributed in the EDA phase which informs you of class imbalance if exists. src/complete_eda.ipynb
. eda_plot.png
is not scaled to a preview-friendly size, though it looks good in your final report.This was derived from the JOSE review checklist and the ROpenSci review checklist.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
src/cross_validation.py
script, though this may be due to something to investigate on my end. I did not tick Automation
checkbox because the Final Report creation does not appear to be automated.Edible
and Poisonous
, rather than by their associated numeric binary values, for clarity.This was derived from the JOSE review checklist and the ROpenSci review checklist.
From GloriaWYY
's comment:
"In src/preprocessor.py, you fit and transform the entire training set, and then pass this model to cross_validation.py to assess the model's performance. This would potentially break the Golden Rule because your preprocessor learns information from the whole training set, which means during cross-validation, information leaks from cross-validation split. This can be solved by removing preprocessor.fit_transform(df1, df2); you do not need this because with your pipeline defined in cross_validation.py, cross-validation will be performed properly and automatically for you, and you do not need to manually transform the data beforehand."
We commented the line of code preprocessor.fit_transform(df1, df2)
so it will not be run. Please check here: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/src/preprocessor.py
All four reviewers have mentioned that our project is lacking of automation. It is because we haven't finished the Makefile
when they were reviewing our work.
We have now added the Makefile
and Dockerfile
and wrote down the instructions of how to use them in details under the Usage
part in README.md
, please check here:
Thank you for the feedback. I will update this post as issues are addressed.
Regarding points in comment 1
Regarding points in comment 2
Regarding checklist items from multiple comments
I thankfully acknowledge the reviewers for taking the time to comment. We have tried our best to address the comments. Added some explanation in the "Limitation and conclusion" section to address real-life implication of the model.
Submitting authors: @dol23asuka @Kylemaj @Mahm00d27
Repository: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor Report link: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/doc/Poisonous_Mushroom_Predictor_Report.md Abstract/executive summary: As mushrooms have distinctive characteristics which help in identifying whether they are poisons or edible. In this project we have built a logistic regression classification model which can use several morphological characteristics of mushrooms to predict whether an observed mushroom is toxic or edible (non-toxic). Exploratory data analysis revealed definite distinctions between our target classes, as well as highlighting several key patterns which could serve as strong predictors. On the test data set of 1,625 observations our model performed extremely well with a 99% recall score and a 100% precision score. The model correctly classified 863 edible and 761 toxic mushrooms. One false negative result was produced (toxic mushroom identified as non-toxic). In the context of this problem, a false negative could result in someone being seriously or even fatally poisoned. We must therefore be far more concerned with minimizing false negatives than false positives. Given this precedent, we may consider tuning the threshold of our model in order to minimize false negatives at the potential cost of increasing false positives. Moving forward, we would like to further optimize our model, investigating if we could potentially get similar performance with less features. Finally, we would like to evaluate how our model performs on real observations from the field rather than hypothetical data.
Editor: @flor14 Reviewers: Cui_Vera, Ye_Wanying, Taskaev_Vadim, Lv_Kingslin