Submission: 5: Predicting Breast Cancer With Multiple Classification Algorithms

Submitting authors: @edile47 @clichyclin @nhantien @ClaudioETC

Repository: https://github.com/DSCI-310/DSCI-310-Group-5

Abstract/executive summary: The project seeks to provide a solution to the prediction problem of spotting benign and malignant tumors, which comes from the question "Is there a way to efficiently classify whether a tumor is malignant or benign with high accuracy, given a set of different features observed from the tumor in its development stage?". Such problem was resolved using a predictive model. Our initial hypothesis was that it is possible to do so yet it would have a high error rate due to tumors features' variations. After performing EDA, such as summary statistics and data cleaning and visualization, we were able to spot some clear distinctions between benign and malignant tumors in some features. We then tested multiple different classification models and arrived at a K-Nearest-Neighbor model with tuned hyperparameters with very good accuracy, recall, precision and f1 score.

Editor: @ttimbers

Reviewer: @TimothyZG @hmartin11 @poddarswakhar @nkoda

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

I found the read me file done incredibly well! I was able to follow very easily and it works seamlessly well on my machine. They even provided instructions on how to resolve potential conflicts with jupyter lab's ports. The whole project is easily reproducible and the it contains everything a person would need to understand the project, including the original dataset (which is in the data/raw path).
There are several problems with rendering: For example, the references are not printed, I think it's because they didn't specify which references they used so there's currently nothing being printed out. Also, In their breast_cancer_prediction.md file I could see they were trying to neglect the codes in the actual printed file but it's not working.
Regarding the graphs, the box-plots in the analysis didn't have color lables on them, which is kind of confusing to understand. Also, the boxplots are displaying in a weird way that many of them have the benign portion as more of a scatterplot.
I found the style generally easy to follow, but there're several things that could be improved. For example, not every subtitles are formatted correctly in their report and there are a few typos here and there.

Overall, they clearly put a lot of work into the project, I'm especially impressed by the reliable workflow they created.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @poddarswakhar

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.2 Hours

Review Comments:

1.) Style guidelines: I believe there is a roam for improvement for the style guidelines on the script files, to be more specific like commenting part of codes to explain briefly what a chunk of code is doing, for easy understanding and following the code.

2.) For the data part, couldn't find the source of the raw data, in the make file it's more of like reading the CSV from the directory and then doing the analysis, for some readers this might not be super transparent.

3.) In the analysis file couldn't find the authors, so couldn't check that box

4.) Overall well done, really loved the analysis and the methodology used! Loved the use of pipelines, making some of the code simpler and avoiding redundancy.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

-overall great job! Your report was interesting to read, and easy to follow. -the Docker instructions were easy to follow and worked well, making the project reproducible which is a very important aspect!

your methods were easy to understand, and narrated the code nicely.
The data is not reproducible from the original source. I believe that you guys didn't edit the dataset, but in the real world someone could do that. That's why loading the dataset from the URL where you downloading it is more reproducible and trustworthy, instead of just reading from a csv in your repo. I suggest fixing this for your final report.
I could not find authors listed in the report
the style of your functions was not consistent across the project
some functions had no documentation included, which makes the code less readable

Data analysis review checklist

Reviewer: nkoda

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

The read me file was very well written. I was able to follow along without any hastle in terms of running the analysis. I particularly would like to see a "requirements" section particularly the space needed for the docker image since I had to abort my docker pull midway as I ran out of room.
within the jupyter environment, I noticed that my index.html file didnt show properly (just loaded the raw html file) when I opened it in a new browser tab. However when I open the file within jupyter, I was prompted with a login to the file. However it seems that when I click the breast_cancer_prediction.html file it rendered the actual jupyer book within jupyter. I think this might be the file you were actually referring to? Either way I would consider adding a couple of lines in your README.md file to address this painpoint.
the analysis is well separated and organized properly. The overall flow of reading the analysis was easy to follow and well documented. I especially liked how you bolded the key informations like the performance of the models, or the use of text highlighting to differentiate the functions handles from the rest of the text.
in your source code I noticed that the functions are not documented. Although a descriptive naming scheme was chosen I think especially for audibility to include docstring at the beginning of the function to define what the function does. ref: https://www.codingem.com/python-how-to-document-functions/#:~:text=A%20Python%20docstring%20is%20a,do%20in%20your%20projects%20too.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310 / data-analysis-review-2021