UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 0: breast_cancer_predictor #1

Open flor14 opened 1 year ago

flor14 commented 1 year ago

Submitting authors: Tiffany Timbers Repository: https://github.com/ttimbers/breast_cancer_predictor Report link: https://github.com/ttimbers/breast_cancer_predictor/blob/master/doc/breast_cancer_predict_report.html Abstract/executive summary: Here we attempt to build a classification model using the k-nearest neighbours algorithm which can use breast cancer tumour image measurements to predict whether a newly discovered breast cancer tumour is benign (i.e., is not harmful and does not require treatment) or malignant (i.e., is harmful and requires treatment intervention). Our final classifier performed fairly well on an unseen test data set, with Cohen’s Kappa score of 0.9 and an overall accuracy calculated to be 0.97. On the 142 test data cases, it correctly predicted 138. However it incorrectly predicted 4 cases, and importantly these cases were false negatives; predicting that a tumour is benign when in fact it is malignant. These kind of incorrect predictions could have a severly negative impact on a patients health outcome, thus we recommend continuing study to improve this prediction model before it is put into production in the clinic.

The data set that was used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison (Street, Wolberg, and Mangasarian 1993). It was sourced from the UCI Machine Learning Repository (Dua and Graff 2017) and can be found here, specifically this file. Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians.

Editor: @flor14 Reviewers: Gittu George Alexi Rodríguez-Arelis Varada Kolhatkar

flor14 commented 1 year ago

Data analysis review checklist

Reviewer: Gittu George

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. I really like the project so far! I would include more background information or specify explicitly the question to make it easier for the reader to understand what is this about.
  2. What do you think about using tidymodels instead of caret? Could be a good change considering that many improvements have been included in those packages.
  3. The link to the dataset is not working for me, you should check that.
  4. Adding tests could help to improve the reproducibility of the project.
  5. Have you considered citing the packages? You can use the function citation() to get that information directly from R

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.