Submission: GROUP 2: Diabetes Prediction

Submitting authors: @roanraina @austin-shih @mehdi-naji @florawendy19

Repository: https://github.com/UBC-MDS/diabetes_prediction Report link: https://github.com/UBC-MDS/diabetes_prediction/blob/main/doc/diabetes_report.md Abstract/executive summary: The prevalence and risk of diabetes is a major health concern to everyone around the world. Various factors, including lifestyle, diet, and health information can facilitate diagnoses of this disease. Due to the advancements in data availability, modern data analysis techniques can be employed to speed up and improve the accuracy of disease diagnosis. In this report, we discuss our first attempt at predicting the diagnosis of diabetes, based on standard machine learning methods. It is worth noting that this project is not an original scientific research, and its results cannot be practically used or generalized. This is simply teamwork to cultivate what we have learned in the MDS program at UBC.

Editor: @flor14 Reviewer: @tieandrews @BruceUBC @rkrishnan-arjun @Althrun-sun

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: tieandrews

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Feedback:

1) the introduction was well done and very useful to provide context of magnitude and scale for the WHY behind this approach. It may have been worth setting a target metric goal/value and quantified the impact it could have in dollars saved/lifes saved etc. so when you get a model you can say is it "good enough". 2) In the report, under analysis, it says prioritizing f1 and recall to reduce false positives, optimizing for recall reduces false NEGATIVES, i.e. we don't want to miss a diagnosis on people. You correctly state recall in the EDA section. In your conclusion you state recall as reducing false positives again. 3) In the EDA plots for Age, Education etc. the "oscillating" nature of the density plot is misleading as that is only a result of the data being binned, should have the bandwidth of the density plot increased to remove this artifact of the underlying binning. Also overlapping filled plots aren't easy to read, perhaps change to lines with mean values as vertical lines to support your conclusions in the discussion. 4) I would include each models precision-recall curves on one plot and report the AUC scores for each as the default threshold may be skewing your recall results. 5) For scripts like prediction.py break up the steps into standalone functions. This habit will make testing much easier in the future and you can then also reuse those functions in notebooks by importing them from the script. E.g. in a notebook if you have the functions you can do something like from src.predict import train_svc, predict_svc etc.

Great work and an interesting report!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Althrun-sun

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

1, The descriptive analysis of the dataset is well done, they calculated Diagnosed diabetes percentage (95% CI) and Undiagnosed diabetes percentage (95% CI) and Total diabetes percentage (95% CI). This is one of their biggest bright spots. 2, They used a lot of models for comparative experiments. They adopted Dummy, Decision Tree, KNN, RBF, SVM, Logistic Regression. This enables a better and more comprehensive reflection of the impact of the data on the model, while making the report more convincing. 3, They used area plot in the EDA link, which left a deep impact, but the choice of color is not very good, not bright enough to look good. 4, The Result part is not perfect, because they use too few evaluation indicators. The confusion matrix is the most intuitive way to explain model performance, and it's the best metric to explain to people from different backgrounds, but they don't. 5, The Project Overview is well done, and it can vividly explain the problems, solutions, and overall framework clearly.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: BruceUBC

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hrs

Review Comments:

The project was done completely and someone other than the authors can understand the target and the fundamental processes in the project easily. All of the scripts have functions defined clearly and correctly. Besides, the whole report is well organized since it includes every necessary part in a meaningful order.
For the introduction part of the report, the overall length could be shorter. Otherwise, readers may get distracted by reading all the data and tables. In my opinion, the significant part should be the analysis and results part following the introduction. To be specific, some of the tables in the introduction could be eliminated or set to have a smaller size.
The main plot of the EDA could be smaller. For those categorical variable, perhaps it is benefitial to plot only the count of 0 and 1 in the middle of the histogram. Therefore, it is more convenient for readers to compare the effect of categorical variables on the target. For numeric variables, the correlation table is a good tool to give a direct view of the data if it is feasible.
For the results part, the table could be improved by listing the two machine learning methods in two distinct columns and set the metrics in three different row. By the way, NA is not a good name for a row together with '...1' in the corner. :)
Some of the functions in the script could have some additional documentation in order to help others understand the process more clearly. For instance, adding some purposes after some of the processes such as creating the transformer and classifier would be appreciated by someone who is not familiar in the field.

It is satisfying to read your project, and I have also learned a lot from you. Great job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: rkrishnan-arjun

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 Hours.

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The introduction and the research question were presented well by the team, and it is indeed a very interesting and crucial problem that could be alleviated to a certain extent with machine learning. The team has done well in setting the expectations and the necessity of addressing the problem and explaining how machine learning can help. They've used a collection of tables and plots that convey their results and highlight trends based on feature target interactions.

Possible areas for improvement:

Running the cleaning script clean.py threw a bunch of warnings into the terminal. It would be good to either address these warnings or suppress them if they are not applicable. Running the diabetes_eda.py failed. Maybe you could consider Joel's suggestion for saving Altair plots. I was blocked to proceed ahead. The scripts could be improved by refactoring the code to use functions rather than having everything done in the main function. Adding tests to ensure that the required artifacts for the following scripts are created would make sure errors are caught early on. Image with error: Image with long list of warnings:
The introduction and EDA take up a major chunk of the report while the other areas are lacking details. I would recommend that this section of the report is made more concise while parallelly adding more information on the methods, analysis of the models, and the results.
Although the report contains sufficient tables and figures to convey details well, they are lacking captions and are not numbered correctly. For example, the first table does not have a caption. There are a total of four tables but only three have captions. The plots are missing captions. Also, as it is crucial to keep a track of the false negatives, including the confusion matrix in the discussion section would be valuable.
The clarity of the report can be improved by making sure that statements are accurate and correct. The report mentions that you're using Dummy Regressor as the baseline while this is a classification problem. The raw data contains 0,1, and 2 as the possible values for the target and hence, I expected this to be a multiclassification problem as the EDA also describes 3 classes here. But during prediction, I sense you've considered it as a binary classification problem. It would be good if you could mention the intuitions and assumptions behind this change in the report. Some of the other improvements include correcting the author names from "truetruetruetrue", reducing typos, explaining the need to use multiple models, and having a dedicated section that highlights the assumptions and the limitations of the current analysis.
The interpretation of the scores and the concluding statements could be improved by briefly restating the research question and an overall comment on the models' false negatives and their impact on the prediction. The statement "When it comes to the recall scores, the SVC score was 0.792 compared to the 0.792 for logistic regression." could be corrected with the correct values from the table.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Feedback from @BruceUBC:

Thank you fo your feedback. You have suggested to reduce the introduction part of the report so that it does not distract the readers as the analysis and the as the analysis and the results of the study are the most important. We do understand the point and we agree with it and that is why we have fixed the introduction. We have kept it short and concise. The commit of that change can be found here

Feedback from @tieandrews:

Thank you for your feedback. You have pointed out that the definition of recall in the EDA section of the study and the result section were not the same. This is correct, we are interested in our study in predicting diabetes and not overly concerned with false positives and therefore we are prioritizing recall. We have updated our definition of recall in the EDA and in the analysis in a way that they do not conflict. The change can be found here

Feedback from @rkrishnan-arjun

Thank you for your feedback. You have highlighted that the scores for SVC and the one for logistic regression are not correctly reported in the report, and you also suggested that we improve the concluding statements. You are right, there was a mismatch in the SVC and the Logistic Regression scores. We have fixed that and also improved the concluding statements as suggested and the proof of those changes can be find here

Feedback from @Althrun-sun

Thank you for your feedback. You mentioned that the report is not perfect and you also added that there are too few evaluation indicators in the result section. We understand and we do agree with your feedback and we have reviewed the whole section and improved a couple of points in our report and we added a limit section to capture the limitations of our study ; you can view our change here

@BruceUBC

The reviewer mentioned that our report lacks captions and when they are there, they are not numbered properly. We understand how crucial this is as it makes our work more organized and helps the reader to understand the flow of our work. We have taken that feedback into account and we have labelled all the plots with accurate number and caption and this can be found here

UBC-MDS / data-analysis-review-2022