Submission: Group 7: Cervical Cancer Predictor

Submitting authors: @samson-bakos, @morrismanfung, @WaielonH, @revathyponn

Repository: https://github.com/UBC-MDS/cervical-cancer-predictor Report link: https://raw.githubusercontent.com/UBC-MDS/cervical-cancer-predictor/main/Analysis_Docs/Analysis.html Abstract/executive summary:

Our research question is whether or not we can diagnose or stratify risk for cervical cancer based on lifestyle factors, sexual history, and comorbidities using a machine learning model. Cervical Cancer is a potential long term complication of the STI Human Papillomavirus (HPV). It is extremely prevalent in women in low income countries in South America and Africa, where health services are not as readily available, and proactive screening is not as common. As such there is a niche for an early identification diagnostic tool based on simple medical/ lifestyle history.

The data is composed of survey results and medical records for 858 female patients from 'Hospital Universitario de Caracas' in Caracas, Venezuela, alongside the results of four traditional diagnosis tests (i.e. biopsy), collected by Fernandes et al. Features include sexual history (such as number of partners, marital status, etc), general carcinogens such as smoking, as well as STI comorbidities. Missing data is an issue for many responses due to the personal nature of some questions, limiting our already small sample size.

We evaluated multiple ML models and tuned promising candidates before running them against the test set. We set an evaluation criteria of minimum 0.8 recall at an operating point of 0.28 precision. This precision is set deliberately low (at only twice the population diagnosis rate), to allow for maximum recall, due to the significant danger of type II error, which is failure to identify a case of Cervical Cancer. Unfortunately, no model was able to meet our diagnostic criteria. The best model at the operating point was Naive Bayes with a recall of 0.58, and the best overall model across all thresholds was RandomForests with AUC = 0.6458333.

Further development could be pursued via investigating other models that may more accurately capture the data/target relationship, as well as feature engineering and transformation. Until then, there may be some application of our models as an ensemble risk flagging system, but it by no means is able to replace ordinary medical diagnostic testing at this stage.

Editor: @flor14 Reviewer: Vincent Ho, Renee Kwon, Peng Zhang, Crystal Geng

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @THF-d8 (Crystal Geng)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

The project is very well done with an excellent background introduction, a very well-thought research approach and a beautifully written analysis report.
I particularly like the fact that the threshold was selected in order to maximize recall and keep the false negatives low, since the false negatives are definitely very harmful in the prediction of cervical cancer!
The main function is a little bit long in the model_training.py script. It might be better to split up this function into several small functions with one corresponding to each model, and make calls to those functions in the main(). Another way to approach this could be creating a separate script for the training of each model. There is nothing wrong with the way it is laid out currently, and my suggestion is only for the purpose of improving readability.
There are some error messages in the cervical_cancer_data_eda.ipynb file, where it says background_gradient requires matplotlib, potentially due to matplotlib not being imported.
It might be good to have some subdirectories in the results folder to keep the files more organized, e.g., one for PR curve files, one for threshold files, etc.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: pengzh313 (Peng Zhang)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2.5 hours

Review Comments:

The main purpose of a Contributing file is to provide guideline on how others can contribute to your project. The second paragraph of the Contributing.md file is more like a procedure for your internal project members to deal with GitHub branching and pull requests. You may move this section to your team work contract or procedure document.
The Authors’ names have not been included in the Analysis report.
The dataset has 4 different target variables. In your EDA report, the combination of the four target variables into a single binary target variable has been clearly explained. However, this key information was not mentioned in your analysis report. You may consider to explicitly indicate the modification within the Methods section of the analysis report.
Some features in the original dataset have a significant number of missing values. In the EDA report you have described this and also visualized the missing value distribution for binary/categorical features in Figure 2. However, we cannot clearly see a full picture of the missing value issue for all features. For example, most data were missed for two numeric features, “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis”. You could use Pandas info() function to show the number of null values for each feature.
By checking the model_training.py file, I have noticed that you specifically selected features from the original dataset to train your models. However, it seems this feature selection was not mentioned in the analysis report. I would suggest to briefly explain what is your criteria to select these features and why other features were not be utilized in the model training.
In the references section of the analysis report, you may consider to add the reference information for Python programming language and any Python packages that have been used for the analysis.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: vincentho32 (Vincent Ho)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

You may consider to round the numerical values in the Test set scores table to 2 decimal places so that it can look more readable.
The report should include a list of authors with your affiliations.
You may consider to open sub-folders for different sections of files for the sake of organized structure. For example, under src folder, you could open an eda folder for all EDA-related files.
It would be better if you could make Figure 1: EDA for Binary/ Categorical Features larger so that it is more readable.
Overall, it is a very good project, but in my opinion it would be better to add more details in the Conclusion and Next Steps section. For example, how would you do feature engineering in this case.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @renee-kwon Renee Kwon

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Great project! The project was easy to understand and well structured, despite the complex and scientific context (although I still had to google "comorbidities"). I enjoyed reading about the background to the problem at hand and the explanation of why your data had many missing values was a nice touch.
In the Analysis Report, you explained why you chose to use recall as your metric - to reduce false negatives. I think it would be helpful to the reader to explain this in the context of your data. "A false negative would be dangerous because... compared to a false positive which... " A confusion matrix visualization may help readers understand concepts of Type I and Type II errors better.
The analysis is missing the authors of the report and it was difficult to read axis names for the distribution plots, even when I zoomed in. Perhaps it could be widened so the text is clear. Also, for the table of scores for your three optimized models, column names could be changed "RFC_opt" to just "RFC model" to improve readability and visuals.
In the references section of the analysis report (.bib), I would suggest you add the packages used in any portion of the project, for example, tidyverse was used in writing the analysis.
If you add github_document as an output of the analysis.rmd, readers can get a glance of the report without having to clone the entire repo.
One last thing in the README.md, I think the scripts would be easy to read/copy if they were each placed on a separate line from the instructions.

Congratulations on building a model from start to finish and communicating your results in a clear and concise manner. Well done, group 7!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Feedback + Fix Commit:

We fixed a lot of issues this week, this is just a few. For a full list please see the issue linked at the bottom of this comment.

List Authors in Analysis Doc: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89

Add .md version of the analysis to render on Github: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89

Round values in results table for more readability: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/670cc7004fba02fc463fe95336821ec945e4d243

Improve readability/ make copying scripts easier in ReadMe https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898

Add a script for Tidyverse installation incase it is not present on reproducing device (tricky to do in environment according to Florencia, easier to hard install using script) https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898

Added context to why we are minimizing Type II error in the analysis report + added brief explanation of automation packaging https://github.com/UBC-MDS/cervical-cancer-predictor/commit/0dc9c00d26e6f479a6965221b999adfd859d7791

For full list of fixes made this week, see this issue: https://github.com/UBC-MDS/cervical-cancer-predictor/issues/41

UBC-MDS / data-analysis-review-2022