Submission: Group 2: diabetes classification model

Submitting authors: @ella-irene @scout-mckee @angelachenmo @s-voon

Repository:https://github.com/UBC-MDS/diabetes_classification_model Report link:https://ubc-mds.github.io/diabetes_classification_model/diabetes_classification_model_report.html Note: The analysis takes a long time to run because of the size of the data set. To run an analysis with a smaller training set, use a larger ratio for the split-ratio in the first terminal command: python scripts/download_split_data.py \ --id=891 \ --write-to=data/raw \ --random=123 \ --split-data-to=data/processed \ --split-ratio=0.35 Abstract/executive summary: In this project, we try to create models for predicting diabetes. We try several different models such as logistic regression, k- nearest neighbours (k-nn), and decision tree. We perform hyper parameter optimization for the decision tree and the knn model. We also use the logistic regression model to explore which features are most important for the classification.

Editor: @ttimbers Reviewer: Weiran Zhao, Ian MacCarthy, Kiersten Gilberg, Rachel Bouwer

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @rbouwer

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2.5

Review Comments:

General Checks

I don’t think you should have the .DS_Store or .Rhistory in your repo (could add these to your .gitignore file). I believe I only saw the .Rhistory file in the root, however I saw the .DS_Store in the root as well as the following folders: data/, report/, results/ (results/models/, results/tables/), src/, and tests/.
I also saw that you have your .ipynb files from old Milestones in your src/ and doc/ folder. I’m not sure the exact protocol for archived files but maybe you could add some documentation to state that these are old versions of your current analysis (or group them all in an archived folder).

Documentation

Great README! To make the running of the analysis more clear, you could separate it from the ‘Method 2’ section as it also can be run by those using ‘Method 1’.
For the license, I only saw the MIT part (my group did the same and were recommended to include the Creative Commons part as well)

Code Quality

Your scripts are super easy to follow! You could also add default parameters to your click options to make it easier for users to reproduce your default results

Tests

You seem to only have if __name__ == “main”: pytest main() in your test-get-feature-importance.py (and not the other tests). You could add this to all tests for consistency
It appears as though the bottom test in the tests/ folder is a duplicate of the test file above

Automation

The analysis steps seem to run well in the terminal without throwing any errors (although for some reason they took over an hour to run on my laptop).
Also it could be good for full reproducibility to add a command in/after your analysis section in your README on how to build the html report (like the jupyter-book command example in tiff’s repo README)

Analysis Report

I really enjoyed reading your analysis report and everything is very well laid out and all of the figures, hyperlinks, citations, and glue objects look great!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Kierst01

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Overall, I think this project is very well done. The clarity of the the README file made it easy to follow along with reproducing the analysis. I also though that the narrative of the report was clear and the necessary components were explained well. There are a couple things I noticed that can be added:

The instructions for how to use the environment and container are clear, but the list of dependencies is missing. Additionally, I could not open jupyter lab from the provided environment (I instead opened it using the base environment and then used the environment kernel within Jupyter lab - so this could be something to add in case others encounter the same thing).
Although the authors names are on the analysis file, they are not on the rendered report
The methods section is missing the assumptions and limitations of the methodology
Figure 3 appears to be the wrong image, it displays Figure 2 again rather than the coefficient values
Some of the references are missing DOI's

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @weiranzhao97

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hr

Review Comments:

The overall structure of the analysis and reproducibility of project is good. Just that there could be more insights delivered and further improvements stated in the end
It is preferred to add Creative Commons to your license; DOI should be accessible to all references

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: ianm99

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2h

Review Comments:

This is a really cool project with very relevant implications! With that in mind I think you could spend some more time clarifying what the application of your model would be, since this could really grab a reader's attention. The title makes it sound like a prediction tool for diagnosis whereas the analysis looks a little like it should have been an inferential tool for identifying risky lifestyles.
I'm assuming the references with no link were accessed through a database with no doi available~
I would be interested to see a brief summary of how the general health index is calculated since you identified it as being quite important to your model.
I like your readme file a lot, and your scripts are very nicely laid out.
I think your set-up instructions have suffered from piecemeal editing 😝 . You require all users to install docker regardless of whether it will be used and seem to indicate that the repo should be cloned multiple times. Totally a small thing but also the sort of thing that would send me over the edge if I weren't familiar with the software already.
Container seemed to set up properly but the analysis notebook took 45 min to run before I eventually interrupted it. Not sure whether there is a bug somewhere or whether my laptop just can't handle the computation.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023