UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 7: Cervical Cancer Predictor #10

Open samson-bakos opened 1 year ago

samson-bakos commented 1 year ago

Submitting authors: @samson-bakos, @morrismanfung, @WaielonH, @revathyponn

Repository: https://github.com/UBC-MDS/cervical-cancer-predictor Report link: https://raw.githubusercontent.com/UBC-MDS/cervical-cancer-predictor/main/Analysis_Docs/Analysis.html Abstract/executive summary:

Our research question is whether or not we can diagnose or stratify risk for cervical cancer based on lifestyle factors, sexual history, and comorbidities using a machine learning model. Cervical Cancer is a potential long term complication of the STI Human Papillomavirus (HPV). It is extremely prevalent in women in low income countries in South America and Africa, where health services are not as readily available, and proactive screening is not as common. As such there is a niche for an early identification diagnostic tool based on simple medical/ lifestyle history.

The data is composed of survey results and medical records for 858 female patients from 'Hospital Universitario de Caracas' in Caracas, Venezuela, alongside the results of four traditional diagnosis tests (i.e. biopsy), collected by Fernandes et al. Features include sexual history (such as number of partners, marital status, etc), general carcinogens such as smoking, as well as STI comorbidities. Missing data is an issue for many responses due to the personal nature of some questions, limiting our already small sample size.

We evaluated multiple ML models and tuned promising candidates before running them against the test set. We set an evaluation criteria of minimum 0.8 recall at an operating point of 0.28 precision. This precision is set deliberately low (at only twice the population diagnosis rate), to allow for maximum recall, due to the significant danger of type II error, which is failure to identify a case of Cervical Cancer. Unfortunately, no model was able to meet our diagnostic criteria. The best model at the operating point was Naive Bayes with a recall of 0.58, and the best overall model across all thresholds was RandomForests with AUC = 0.6458333.

Further development could be pursued via investigating other models that may more accurately capture the data/target relationship, as well as feature engineering and transformation. Until then, there may be some application of our models as an ensemble risk flagging system, but it by no means is able to replace ordinary medical diagnostic testing at this stage.

Editor: @flor14 Reviewer: Vincent Ho, Renee Kwon, Peng Zhang, Crystal Geng

THF-d8 commented 1 year ago

Data analysis review checklist

Reviewer: @THF-d8 (Crystal Geng)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

  1. The project is very well done with an excellent background introduction, a very well-thought research approach and a beautifully written analysis report.
  2. I particularly like the fact that the threshold was selected in order to maximize recall and keep the false negatives low, since the false negatives are definitely very harmful in the prediction of cervical cancer!
  3. The main function is a little bit long in the model_training.py script. It might be better to split up this function into several small functions with one corresponding to each model, and make calls to those functions in the main(). Another way to approach this could be creating a separate script for the training of each model. There is nothing wrong with the way it is laid out currently, and my suggestion is only for the purpose of improving readability.
  4. There are some error messages in the cervical_cancer_data_eda.ipynb file, where it says background_gradient requires matplotlib, potentially due to matplotlib not being imported.
  5. It might be good to have some subdirectories in the results folder to keep the files more organized, e.g., one for PR curve files, one for threshold files, etc.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

pengzh313 commented 1 year ago

Data analysis review checklist

Reviewer: pengzh313 (Peng Zhang)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2.5 hours

Review Comments:

  1. The main purpose of a Contributing file is to provide guideline on how others can contribute to your project. The second paragraph of the Contributing.md file is more like a procedure for your internal project members to deal with GitHub branching and pull requests. You may move this section to your team work contract or procedure document.

  2. The Authors’ names have not been included in the Analysis report.

  3. The dataset has 4 different target variables. In your EDA report, the combination of the four target variables into a single binary target variable has been clearly explained. However, this key information was not mentioned in your analysis report. You may consider to explicitly indicate the modification within the Methods section of the analysis report.

  4. Some features in the original dataset have a significant number of missing values. In the EDA report you have described this and also visualized the missing value distribution for binary/categorical features in Figure 2. However, we cannot clearly see a full picture of the missing value issue for all features. For example, most data were missed for two numeric features, “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis”. You could use Pandas info() function to show the number of null values for each feature.

  5. By checking the model_training.py file, I have noticed that you specifically selected features from the original dataset to train your models. However, it seems this feature selection was not mentioned in the analysis report. I would suggest to briefly explain what is your criteria to select these features and why other features were not be utilized in the model training.

  6. In the references section of the analysis report, you may consider to add the reference information for Python programming language and any Python packages that have been used for the analysis.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

vincentho32 commented 1 year ago

Data analysis review checklist

Reviewer: vincentho32 (Vincent Ho)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

  1. You may consider to round the numerical values in the Test set scores table to 2 decimal places so that it can look more readable.

  2. The report should include a list of authors with your affiliations.

  3. You may consider to open sub-folders for different sections of files for the sake of organized structure. For example, under src folder, you could open an eda folder for all EDA-related files.

  4. It would be better if you could make Figure 1: EDA for Binary/ Categorical Features larger so that it is more readable.

  5. Overall, it is a very good project, but in my opinion it would be better to add more details in the Conclusion and Next Steps section. For example, how would you do feature engineering in this case.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

renee-kwon commented 1 year ago

Data analysis review checklist

Reviewer: @renee-kwon Renee Kwon

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Congratulations on building a model from start to finish and communicating your results in a clear and concise manner. Well done, group 7!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

samson-bakos commented 1 year ago

Feedback + Fix Commit:

We fixed a lot of issues this week, this is just a few. For a full list please see the issue linked at the bottom of this comment.

List Authors in Analysis Doc: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89

Add .md version of the analysis to render on Github: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89

Round values in results table for more readability: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/670cc7004fba02fc463fe95336821ec945e4d243

Improve readability/ make copying scripts easier in ReadMe https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898

Add a script for Tidyverse installation incase it is not present on reproducing device (tricky to do in environment according to Florencia, easier to hard install using script) https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898

Added context to why we are minimizing Type II error in the analysis report + added brief explanation of automation packaging https://github.com/UBC-MDS/cervical-cancer-predictor/commit/0dc9c00d26e6f479a6965221b999adfd859d7791

For full list of fixes made this week, see this issue: https://github.com/UBC-MDS/cervical-cancer-predictor/issues/41