Open samson-bakos opened 2 years ago
This was derived from the JOSE review checklist and the ROpenSci review checklist.
The main purpose of a Contributing file is to provide guideline on how others can contribute to your project. The second paragraph of the Contributing.md file is more like a procedure for your internal project members to deal with GitHub branching and pull requests. You may move this section to your team work contract or procedure document.
The Authors’ names have not been included in the Analysis report.
The dataset has 4 different target variables. In your EDA report, the combination of the four target variables into a single binary target variable has been clearly explained. However, this key information was not mentioned in your analysis report. You may consider to explicitly indicate the modification within the Methods section of the analysis report.
Some features in the original dataset have a significant number of missing values. In the EDA report you have described this and also visualized the missing value distribution for binary/categorical features in Figure 2. However, we cannot clearly see a full picture of the missing value issue for all features. For example, most data were missed for two numeric features, “STDs: Time since first diagnosis” and “STDs: Time since last diagnosis”. You could use Pandas info() function to show the number of null values for each feature.
By checking the model_training.py file, I have noticed that you specifically selected features from the original dataset to train your models. However, it seems this feature selection was not mentioned in the analysis report. I would suggest to briefly explain what is your criteria to select these features and why other features were not be utilized in the model training.
In the references section of the analysis report, you may consider to add the reference information for Python programming language and any Python packages that have been used for the analysis.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
You may consider to round the numerical values in the Test set scores table to 2 decimal places so that it can look more readable.
The report should include a list of authors with your affiliations.
You may consider to open sub-folders for different sections of files for the sake of organized structure. For example, under src folder, you could open an eda folder for all EDA-related files.
It would be better if you could make Figure 1: EDA for Binary/ Categorical Features larger so that it is more readable.
Overall, it is a very good project, but in my opinion it would be better to add more details in the Conclusion and Next Steps section. For example, how would you do feature engineering in this case.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Great project! The project was easy to understand and well structured, despite the complex and scientific context (although I still had to google "comorbidities"). I enjoyed reading about the background to the problem at hand and the explanation of why your data had many missing values was a nice touch.
In the Analysis Report, you explained why you chose to use recall as your metric - to reduce false negatives. I think it would be helpful to the reader to explain this in the context of your data. "A false negative would be dangerous because... compared to a false positive which... " A confusion matrix visualization may help readers understand concepts of Type I and Type II errors better.
The analysis is missing the authors of the report and it was difficult to read axis names for the distribution plots, even when I zoomed in. Perhaps it could be widened so the text is clear. Also, for the table of scores for your three optimized models, column names could be changed "RFC_opt" to just "RFC model" to improve readability and visuals.
In the references section of the analysis report (.bib), I would suggest you add the packages used in any portion of the project, for example, tidyverse was used in writing the analysis.
If you add github_document as an output of the analysis.rmd, readers can get a glance of the report without having to clone the entire repo.
One last thing in the README.md, I think the scripts would be easy to read/copy if they were each placed on a separate line from the instructions.
Congratulations on building a model from start to finish and communicating your results in a clear and concise manner. Well done, group 7!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Feedback + Fix Commit:
We fixed a lot of issues this week, this is just a few. For a full list please see the issue linked at the bottom of this comment.
List Authors in Analysis Doc: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89
Add .md version of the analysis to render on Github: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/1b233a8e7cadbfab10e527b2954cf14624601f89
Round values in results table for more readability: https://github.com/UBC-MDS/cervical-cancer-predictor/commit/670cc7004fba02fc463fe95336821ec945e4d243
Improve readability/ make copying scripts easier in ReadMe https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898
Add a script for Tidyverse installation incase it is not present on reproducing device (tricky to do in environment according to Florencia, easier to hard install using script) https://github.com/UBC-MDS/cervical-cancer-predictor/commit/6f202542287a3d04037d9d49a281a9ee8d0f6898
Added context to why we are minimizing Type II error in the analysis report + added brief explanation of automation packaging https://github.com/UBC-MDS/cervical-cancer-predictor/commit/0dc9c00d26e6f479a6965221b999adfd859d7791
For full list of fixes made this week, see this issue: https://github.com/UBC-MDS/cervical-cancer-predictor/issues/41
Submitting authors: @samson-bakos, @morrismanfung, @WaielonH, @revathyponn
Repository: https://github.com/UBC-MDS/cervical-cancer-predictor Report link: https://raw.githubusercontent.com/UBC-MDS/cervical-cancer-predictor/main/Analysis_Docs/Analysis.html Abstract/executive summary:
Our research question is whether or not we can diagnose or stratify risk for cervical cancer based on lifestyle factors, sexual history, and comorbidities using a machine learning model. Cervical Cancer is a potential long term complication of the STI Human Papillomavirus (HPV). It is extremely prevalent in women in low income countries in South America and Africa, where health services are not as readily available, and proactive screening is not as common. As such there is a niche for an early identification diagnostic tool based on simple medical/ lifestyle history.
The data is composed of survey results and medical records for 858 female patients from 'Hospital Universitario de Caracas' in Caracas, Venezuela, alongside the results of four traditional diagnosis tests (i.e. biopsy), collected by Fernandes et al. Features include sexual history (such as number of partners, marital status, etc), general carcinogens such as smoking, as well as STI comorbidities. Missing data is an issue for many responses due to the personal nature of some questions, limiting our already small sample size.
We evaluated multiple ML models and tuned promising candidates before running them against the test set. We set an evaluation criteria of minimum 0.8 recall at an operating point of 0.28 precision. This precision is set deliberately low (at only twice the population diagnosis rate), to allow for maximum recall, due to the significant danger of type II error, which is failure to identify a case of Cervical Cancer. Unfortunately, no model was able to meet our diagnostic criteria. The best model at the operating point was Naive Bayes with a recall of 0.58, and the best overall model across all thresholds was RandomForests with AUC = 0.6458333.
Further development could be pursued via investigating other models that may more accurately capture the data/target relationship, as well as feature engineering and transformation. Until then, there may be some application of our models as an ensemble risk flagging system, but it by no means is able to replace ordinary medical diagnostic testing at this stage.
Editor: @flor14 Reviewer: Vincent Ho, Renee Kwon, Peng Zhang, Crystal Geng