Final Peer Review-Bingxin Wu (bw383)

The focus of the project is to use the medical and behavioral information of patients to predict their potential risk of getting cervical cancer. The dependent variable that the group wanted to look at is whether or not the patient was recommended to take biopsy, as it is a metric for cancer risk.

Things that are really nice in the project:

For the introduction part, I like how the group clearly explained why the project is important for future patients, and started the project by intuitively chose factors for visualization.
I like how the group chose a wide range of appropriate methods to use, and decide to have each method applied to both cancerous and non-cancerous data. It is a very smart and responsible step to do. The attempt the group had to control for the effect of rare cancer incidence.
For the combating overfitting and extension part, the group’s effort in understanding high accuracies is very critical and well thought. After consulting with a practicing oncologist, the group found out that the Hinselmann, Schiller and Cytology tests results actually skewed the accuracy results, and they adjusted the models accordingly.

Improvements needed for the project:

The group should have their actual project name listed as their report name instead of the name of the course. Also, since this report is relatively long, it will be really useful for the viewers if the group can include a table of contents at the beginning.
For the data cleaning part, I think the group could change their wordings, as the group stayed in the report that they would “ignore these data points”, which can be easily mistaken. As for the actual missing data filling method, I think the group needed to look into the error, instead of simply saying that their method produces the least bias without justification. The method the group used sounds like a combination of hot decking and filling in the column means, which can both be problematic since both methods change the original data distribution and can create bias. If conducting unsupervised learning is hard and time-consuming, one commonly used method for statistics is to fill in random data information from other complete rows for 10 to 20 times, and conduct the same analysis for each completed dataset. Then, the group can look at how the coefficients and results change.
For the conclusion part, it will be nice to have all results from different methods listed in the same table, in order to give the viewers a better understanding. Also, as mentioned in the earlier Extension section, the group could have mentioned in the Promise for Commercial Application that the models are statistically significant but not as practically significant due to the nature of dataset (high correlation between 3 tests and biopsy recommendation).

Overall, the project looks decent and well thought out.

bakulcsingh / CervicalCancerBiopsyPrediction

Final Peer Review-Bingxin Wu (bw383) #12