Submission: 17: heart_disease_predictor

Submitting authors: @Natalie-cho @Yurui-Feng @elenagan @tzoght

Repository: https://github.com/UBC-MDS/heart_disease_predictor Report link: https://github.com/UBC-MDS/heart_disease_predictor/blob/main/book.pdf Abstract/executive summary: Responsible for 16% of the world's total deaths in 2019, heart disease is the world's leading cause of death according to the World Health Organization. The development of heart disease can not be contributed to a single factor in isolation, making early detection difficult given so many risk factors.

The goal of this project is to predict the presence of heart disease based on common early signs and the Heart Disease UCI dataset from the UC Irvine Machine Learning Repository to answer the question: given common early signs and physiological indicators, can we predict the presence of cardiac disease based on symptoms such as chest pain, blood pressure, or resting ECG?

Responding to this question may aid in the early detection of heart disease and may help with earlier treatment, crucial to improving an individual's chances of survival.

Editor: @flor14 Reviewer: Luke Yang, Caesar Wong, Xinru Lu, Manvir Kohli

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

General

Hello, Group 17. Congratulation on your work on this heart disease predictor. Below are my comments based on your project!

Data analysis review checklist

Reviewer: @lukeyf

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well-organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Comments:

The src contains concisely the four files that were used for the pipeline of analysis. The structure is clear and no files are too deep from the root of the project.

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
Comments:

Usages are properly documented; however, the instruction does not match the actual scripts in the repository (for example the instruction says download_data.py whereas the src contains fetch_data.py). If your code is under development, do not forget to update the Readme after modifications.

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Comments:

Yep. Functions are well-written and well-documented. The scripts are modular with helper functions.

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Comments:

The source code in src is clear on which file to call. I was able to execute until the analysis. But when I was trying to generate the report it returns the error pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded. I was not sure if this was only my machine so if others return a similar problem please note that.

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Comments:

The writing was coherent and concise. The eda was not too overwhelming and the result is clear. However, I notice that in your book.pdf one of the tables is cut off because it was too long. I suggest removing some of the unnecessary contents like standard deviation to only reveal the meet (test/train scores)

Estimated hours spent reviewing: 1

Review Comments:

The comments can be summarized in the points below:

The Readme file should be kept up-to-date with the changes you make along the development process.
There is a potential issue with the rendering of the report.
The generated report has a table that was cut off. You can fix that by reducing the content of the table.
The make function is recommended to have a default operation that runs every thing (with make only).
The confusion matrix plot in the final report needs to have a title.

Overall, the project is in a good shape toward completion. The scripts are very solid and the analysis was quite insightful. There are a few things I mentioned in the previous comments and if you have time you can consider addressing them.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @caesarw0

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Note: the whole evaluation is done using the Milestone 3 release version of the project repository (tag: 0.2.0, commit: 4ba76b9)

First of all, I really appreciate the level of detail and effort your team has put into it. The whole project is well constructed with comprehensive materials and extra components. I personally like the code documentation and the linkage between different modules.
I notice there are two functions, namely model and test in the src/model.py. Since these two functions are from the two main processes of the machine learning pipeline, I would suggest splitting model training and model testing into two separate scripts. By decoupling these 2 processes, the user can determine whether we choose either to train or test the model. It can provide more flexibility for the user.
In terms of code modularity and readability, there is a save_chart function defined within a function called eda in the src/eda.py. It is better to separate the save_chart from the master function or create a utility folder for organizing the util related function, like save_chart, so that other scripts can also make use of the chart saving function. It can improve code readability and scalability.
In the readme file, the team mentioned using 4 machine learning models, however, 2 of the models are missing in the actual implementation. Perhaps it can include more models in the analysis if time permits. But personally, I think 2 models are enough for this project.
There is a minor issue when I run the make all command. When the GitHub repository is in a path that contains spaces (e.g. C:\Users\abc\UBC MDS\DSCI 522 Workflows\git), some extra folders are generated. (see below)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Lorraine97

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hr

Review Comments:

In the EDA analysis, when plotting the correlation between categorical variables, and between a categorical variable and a numeric variable, it would be better to use mark_square, mark_rect, etc. functions to visualize the distribution over the values. Scatter plots do not seem to show the distribution effectively here.
Since parameter optimization is done for the models, it might be helpful to include the specific parameters being used in the results table for model selection. In this way, we know that the models being compared are already optimized.
A random idea: maybe we can use anova to prove one model is significantly better than another?
The color for the cross validation result plot in "Test Results" section (in the book.pdf) does not seem to follow the color theory. I might be wrong, but would it better to have one color and differentiate by saturation?
Nothing else stands out to me. It is really nice work!! Good luck with the rest of the project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

The EDA and final reports are nice and concise with proper headings for each section. Maybe try to have less code in the EDA file and if possible no code in the final report.
The plots used, especially in the final report, are well labelled and easy to follow and interpret.
I belive the ReadMe file is not updated as there is mismatch in the name of the the Python script used to download data. The Readme file mentions running the script "download_data.py" in the src directory but the src directory does not have this script. Instead there is a "fetch_dataset.py" which I believe is supposed to be used to download the dataset.
It may be helpful to include the url needed to download the datastet within the Readme file as well as include it when stating how to run the "download_data.py". The current link in the Readme file links to Kaggle and seems to be broken.
Your report mentions there is a slight class imbalance because of which the metric used is the F1 score. With the problem at hand you could maybe try to deal with the class imbalance and use recall as the metric of choice, given that you are already calculating it.
The results from the correlation plots state that there is a correlation between "max_hr_achieved" and the target. However I could not find this feature in the EDA. I believe you are renaming an existing feature. To avoid confusion it may be helpful to include the final features with their names and data types either at the end of your EDA or in the beginning of the final report
Coming back to the correlation plots, the correlation method used is the Spearman correlation, which I believe is used to check correlation between ordinal/ranked variables. However here the correlation is being calcualted between a continuous variable (max_hr_achieved) and a binary variable (Heart Disease vs no Heart Diseases). As such, correlation between the two variables may not be interpretable. However, if it is, you may include the reasoning behind using this metric.
A minor suggestion would be to give a meaningful name to the final report. Currently it's named "book.pdf". The name is not very intuitive and the Readme file does not explicity state what the final report is called. If someone is browsing through the repository, it is difficult to identify the report.

Attribution

This was derived from the [JOSE review checklist](https://openjournals.readthedocs.io/en/jose/review_checklist.

[x] Author and date missing from 2 of your scripts https://github.com/UBC-MDS/heart_disease_predictor/issues/34#issue-1473440613 https://github.com/UBC-MDS/heart_disease_predictor/commit/0381ae42779456375f04c1831adac4214b6b9d46
[x] There are not figure captions https://github.com/UBC-MDS/heart_disease_predictor/issues/34#issue-1473440613 1b28f6c95167897339aed30ffb52ca0758f8acbb
[x] I belive the ReadMe file is not updated as there is mismatch in the name of the the Python script used to download data. The Readme file mentions running the script "download_data.py" in the src directory but the src directory does not have this script. Instead there is a "fetch_dataset.py" which I believe is supposed to be used to download the dataset. https://github.com/UBC-MDS/data-analysis-review-2022/issues/2#issuecomment-1336309680 001a48d37d9494ae2b779195f4c8f2872f844375
[x] The generated report has a table that was cut off. You can fix that by reducing the content of the table https://github.com/UBC-MDS/data-analysis-review-2022/issues/2#issuecomment-1335941623 25040c6c7d52a1f6c5e989a3f28cfe273c0a3f6d
[x] Since parameter optimization is done for the models, it might be helpful to include the specific parameters being used in the results table for model selection. In this way, we know that the models being compared are already optimized https://github.com/UBC-MDS/data-analysis-review-2022/issues/2#issuecomment-1336274360 93f394baf615cc404f5803772013a67d594dd723

UBC-MDS / data-analysis-review-2022