Submission: Group 4: Poisonous_Mushroom_Predictor

Submitting authors: @dol23asuka @Kylemaj @Mahm00d27

Repository: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor Report link: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/doc/Poisonous_Mushroom_Predictor_Report.md Abstract/executive summary: As mushrooms have distinctive characteristics which help in identifying whether they are poisons or edible. In this project we have built a logistic regression classification model which can use several morphological characteristics of mushrooms to predict whether an observed mushroom is toxic or edible (non-toxic). Exploratory data analysis revealed definite distinctions between our target classes, as well as highlighting several key patterns which could serve as strong predictors. On the test data set of 1,625 observations our model performed extremely well with a 99% recall score and a 100% precision score. The model correctly classified 863 edible and 761 toxic mushrooms. One false negative result was produced (toxic mushroom identified as non-toxic). In the context of this problem, a false negative could result in someone being seriously or even fatally poisoned. We must therefore be far more concerned with minimizing false negatives than false positives. Given this precedent, we may consider tuning the threshold of our model in order to minimize false negatives at the potential cost of increasing false positives. Moving forward, we would like to further optimize our model, investigating if we could potentially get similar performance with less features. Finally, we would like to evaluate how our model performs on real observations from the field rather than hypothetical data.

Editor: @flor14 Reviewers: Cui_Vera, Ye_Wanying, Taskaev_Vadim, Lv_Kingslin

[X] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @Kingslin0810

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Overall, the project is well-executed, and the final report clearly states the objective, data used, methodology of carrying out the prediction as well as results and limitations. There are also a lot of references to the research that makes the case solid. Good job team!

very handy to have env-mushroom.yaml created for audiences to reproduce the data project; however, the 'python' packages in the README.md and env-mushroom.yaml are redundant. Also, suggest to briefly summarize a high level 1-2 sentence descriptions at your top right-hand side of the 'About' section in your data project repo.
the environment file URL link in your CONTRIBUTING.md is broken. Suggest asking contributors to fork the repository first then create pull request for contributions.
the README.md doesn't summarize how well your model is performing; I also found it's a little bit overwhelming for all information provided. Would it be redundant to your final report? Suggest to briefly introduce your data initiative and the model performance.
a few latex equations didn't render properly in your final report, such as $$ recall = \frac{TP}{TP+FN} = \frac{3152}{3152+2} \approx 0.99 $$
the cross-validation scores and test data scores are very close to 1 for accuracy, recall, precision and f1 scores; potentially your model is overfitting, and this issue hasn’t been examined or mentioned in your final report. Furthermore, there are 1 false negative (type II error) and 0 false positive (type I error ) while evaluating your model on the test data. Given a very small proportion of poisonous mushroom in your test data, and I am not sure if your model is biased for purpose of concluding the performance. Have you tested for 40-60 split or setting for different random state? Are you getting the same result? I am also wondering if your model will have a great performance score for other similar data. Additionally, it's worthwhile to mention you might target to reduce the type I error over type II error because otherwise the result might put someone’s life in danger by predicting poisonous mushrooms as edible ones.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @GloriaWYY

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Impressive work is shown in this group project. The project repository is well-organized in a way that resources are properly named and can be easily accessed. The complete_eda.ipynb is well-done by following the checklist, which shows that you understood your data well before doing any machine learning.

I would like to bring some issues to your attention to help you improve your project:

The README.md is very informative, but I would suggest giving a brief summary over your project instead of writing detailed Background, Introduction, and Data sections (which are exactly what you have in your final report). You can take a look at the example repository breast_cancer_predictor.
I find your model performing quite well, which is very exciting! However, right now you only present the model's performance. I would like to know more about your model. For example, why the logistic regression algorithm was selected, how do the coefficients learned by your model look like, and how would you interpret them in a way that some features might be important in your prediction tasks while others might not, etc.
There are multiple EDA-related files in src and results folders, and it seems that I cannot locate how some of the artifacts are generated (e.g. I assume results/img/correlation.png is probably generated through pandas profiling; however, it is not extracted explicitly in the script). This might cause some of the results or analysis unreproducible, which is why I did not check the box for Automation in the Reproducibility section. I suggest using Make to automate your complete analysis and see if some intermediates are missing.
In src/preprocessor.py, you fit and transform the entire training set, and then pass this model to cross_validation.py to assess the model's performance. This would potentially break the Golden Rule because your preprocessor learns information from the whole training set, which means during cross-validation, information leaks from cross-validation split. This can be solved by removing preprocessor.fit_transform(df1, df2); you do not need this because with your pipeline defined in cross_validation.py, cross-validation will be performed properly and automatically for you, and you do not need to manually transform the data beforehand.
In the final report, you included Figure 1. Distribution of Target feature. First of all, this plot is stacking bars, and so it is somewhat difficult to compare between target='e' and target='p'. As we learned in DSCI531, would it be better to group the targets so that each target class has the same baseline, or to use stack=False to have one bar overlaying on the other? Secondly, for almost all features, there are many categories with only few counts, and this raises my concern on whether your model can learn something meaningful from this limited amount of data for these categories? Will overfitting be a problem? Lastly, I noticed in the Limitation section, you look back on the data to see if there is class imbalance or not. Instead of doing this, usually it would be good if you can include some aspects of how your targets are distributed in the EDA phase which informs you of class imbalance if exists.
Some viewing issues exist in your plots. e.g. The cap-shape picture and the correlation plot are not rendering in src/complete_eda.ipynb. eda_plot.png is not scaled to a preview-friendly size, though it looks good in your final report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @suuuuperNOVA

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 30 minutes

Review Comments:

Table 2 and Table 3 which are about the confusion matrices seem confusing to me. I think it is better to clarify which side is the predicted value and which side is the true value.
As we've learned in DSCI 531 when presenting the graph, it's better not to have '_' in the labels. The labels are needed to modify.
Figure 2 needs a reference.
I am interested in the coefficients of the features to see which are strong factors and which are weak. It could make the report tell a complete story if there could be some text about the trained model.
I saw there is a dummy classifier as your baseline which works not well, but I think introducing models such as decision tree classifier as another baseline will be more convincing to show the power of the model selected.
In the EDA part, each feature is investigated, but it's unnecessary to show all of them. Only important discoveries need to be mentioned in the report. Plus, features are analyzed individually, but the correlation among features is also essential. I think it would be interesting if to show the target class with two features.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: vtaskaev1

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Good job on this interesting project. It is a relevant candidate for machine learning implementation, complete with a well-defined socio-economic purpose statement.
The added environment yaml setup document made the testing of this repo quite simple, and ensured most of the code ran without issues. I personally encountered an issue running the src/cross_validation.py script, though this may be due to something to investigate on my end. I did not tick Automation checkbox because the Final Report creation does not appear to be automated.
The lack of automated tests is self-explanatory, and would help add confidence in a model with such performance. However, source code is well-annotated with docstrings.
Structure and content-wise, the Background section of the README is good in that it includes a well-rounded overview of the state of recent research pertaining to this type of research, and the numerous relevant references are appreciated. In contrast, I personally found the Introduction rather short in terms of the methodology used in this project, particularly knowing how impressive the results of this model appear to be. And, although the Final Report does elaborate on classification scoring metrics applied, it may have been pertinent to discuss how this particular context, where False Negatives are particularly problematic, may require a more robust scoring metric.
The model’s performance is impressive, but the Limitations and Assumptions section acknowledges possible reasons for this. Data quality may have been very favourable here as was mentioned, so testing its performance on new unseen observations would be critical.
Finally, I found the EDA well-presented, offering a clear summary view of important data characteristics to watch out for. My only minor feedback on this is that it may be beneficial to label categories as Edible and Poisonous, rather than by their associated numeric binary values, for clarity.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

From GloriaWYY's comment:

"In src/preprocessor.py, you fit and transform the entire training set, and then pass this model to cross_validation.py to assess the model's performance. This would potentially break the Golden Rule because your preprocessor learns information from the whole training set, which means during cross-validation, information leaks from cross-validation split. This can be solved by removing preprocessor.fit_transform(df1, df2); you do not need this because with your pipeline defined in cross_validation.py, cross-validation will be performed properly and automatically for you, and you do not need to manually transform the data beforehand."

We commented the line of code preprocessor.fit_transform(df1, df2) so it will not be run. Please check here: https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor/blob/main/src/preprocessor.py

All four reviewers have mentioned that our project is lacking of automation. It is because we haven't finished the Makefile when they were reviewing our work.

We have now added the Makefile and Dockerfile and wrote down the instructions of how to use them in details under the Usage part in README.md, please check here:

https://github.com/UBC-MDS/Poisonous_Mushroom_Predictor

Thank you for the feedback. I will update this post as issues are addressed.

Regarding points in comment 1

I have fixed the environment link in our README file.
Due to very inconsistent latex rendering on different machines we have decided to display affected figures with a more simple syntax.
I have conducted a review of our final report and I believe that most of these points were addressed in the Prediction and Limitations & Assumptions sections with the exception of overfitting. I have updated the report to include our thoughts regarding overfitting and several small edits which will hopefully add clarity to the sections mentioned.

Regarding points in comment 2

Wherever possible I have condensed our README to a more concise high-level summary.

Regarding checklist items from multiple comments

I have added several tests to our scripts

I thankfully acknowledge the reviewers for taking the time to comment. We have tried our best to address the comments. Added some explanation in the "Limitation and conclusion" section to address real-life implication of the model.

UBC-MDS / data-analysis-review-2021