Submission: Group12: heart disease inferential study

Submitting authors: @kellywujy @stepanz25 @ZilongYi @BruceUBC

Repository: https://github.com/UBC-MDS/inferential_study_heart_attack Report link: https://github.com/UBC-MDS/inferential_study_heart_attack/blob/main/doc/heart_disease_report.Rmd Abstract/executive summary: In this project we attempt to find the association between the presence of heart disease and various demographic or health factors of the patients including age, sex, chest pain type, cholesterol levels, etc. We perform hypothesis testing using permutation for numerical variables such as age, the maximum heart rate achieved, and ST depression induced by exercise relative to rest which is considered a proven ECG finding for obstructive coronary atherosclerosis (Lanza et al., 2004). Our original data set also included some categorical variables and we conducted hypothesis testing using chi-squared test to see if these factors relate with presence of heart disease.

Editor: @flor14 Reviewer: < Hutchinson Shaun > < Lin Chen > < Agarwal Tanmay >

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Chen Lin @CChCheChen

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

This project to find the relationship between the possibility of heart disease and some potential factors from patients is informative. While it could be hard for people who is not from health or biology background, like me, to understand some glossaries or abbreviations, such as ECG or ST depression. Those terms could be explained ahead of time if possible.
I particularly like the visual explanation of the files in the src folder, including their input and output visually. Well done team!
Code well constructed and documented with necessary docstring for user to understand the purpose of your code. One suggestion would be to follow the DRY coding principle to reduce some code repetition, for example data_analysis.r has the following sections could have some improvements:
- MEAN CI
- hypothesis testing with permutation
- chi-square testing
Final report has the tables to demonstrate the hypothesis testing results, for both numeric and categorical factors, assembling the conclusions for each factor side by side which makes them easier to understand.
The plots in the final report could be faceted by class (having heart or not) to make the histogram more interpretable especially when the histogram overlapped a lot for bot classes, for example the Histogram of trestbps. Also the plot legend should be more clear of which class it belongs to, instead of 1 or 0.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Shaun Hutchinson: @shaunhutch

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

I tried to create the environment from the environment.yml file, however, I ran into an error. `ResolvePackageNotFound:
- dataframe-image==0.1.3
- docopt==0.7.1`
It looks like dataframe-image should be dataframe_image. In addition, I believe that docopt should be under pip as docopt-ng rather than in conda as per the installation instructions we received in lecture 2.
In trying to run your usage scripts I ran into a couple of issues. The download_file.py has an indentation error at line 20. I believe that you need Seaborn to run the EDA_visualization.py script which you have not listed in your dependencies. The pre_process script seems to have to reference the wrong script.
The organization of your repository is clear with the addition of the USAGE image, however, I think that the results folder could benefit from further organization into subfolders such as eda and analysis. I found as a user looking at the titles was not enough, I had to refer to other documentation or open all files to figure out which area of the project they pertained to.
I found all the codes easy to read with informative usage and examples which helped with understanding how to run the individual scripts.
I think it would be useful in your EDA section of the report to reference what the numerical columns and categorical columns in the charts are. This could be done either in the Chart Titles or in the discussion around it. When reading these histograms it is hard to tell what these variables are from the report alone.

Very interesting project, it was a pleasure to review!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Qurat-ul-Ain Azim @qazim1

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

The project is presented very well. I can see that the group has put in efforts to even the tiniest of details. I personally learned a lot reviewing this. The report is very well written. I particularly like how the limitations of the study are discussed as well as future improvements are addressed.
In the README file, the link to the EDA file is broken. It would be great if the issue is fixed.
The EDA charts in the report are great. However, it would help to briefly outline what these feature names mean so that they would make a bit more sense to the reader. For example, it's not immediately clear what oldpeak or , say, thalach represent.
A bit of column renaming for the tables in the report should help. For example, you could say Mean rather than sample_estimate and remove underscores too to help the reader.
In the preprocessing script, I see that some missing values are imputed while the authors say in the Dataset heading of the README that they'll be using cleaned and. processed data. It would probably be best to reconcile the two statements by explaining explicitly what has been done in the preprocessing script.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Tanmay Agarwal @tanmayag97

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hr

Review Comments:

The EDA is done in a well constructed and concise manner. However, I want to point out that there are null values when looking at dataframe.info(), however the analysis says there are no null values in the dataset. It would be great if you can rectify this minor mistake.
While the choice of EDA plots are interesting, something that can also be done is plotting a correlation heatmap for numerical features considering both targets i.e. heart disease and no heart disease. This will provide further insight into the explantory and the response variables.
Overall, the project is well organized. I, specifically really like the usage section displaying all the locations of all files which is truly commendable. Great job folks!
There should be a little insight into the various explanatory variables like trestbps, restecg etc for viewers having a non medical background. This will increase readibility of the overall project and help people understand the work better.
While the overall report is presented in a clean and effective way, I feel that the references are too cluttered and could be organized in a better way (like adding numbers, or handling spacing/indenting).

Really loved the idea and the overall report. There is great scope in this project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Feedback Review

Hello @Qazim1,

Thank you for your valuable feedback provided to our group. We would like to let you know that you opinion is very important to us and we would like to address some of the points you mentioned above. In particular, we would like to address the issue with the missing values. We totally agree that there may have been some confusion on how our collaborators dealt with missing values. The Cleveland data set that we are dealing with was partially pre-processed as it's one of the commonly used data sets used by data science community; however, there were still some missing values present. Our strategy was to impute those missing values with 'NaN' and subsequently drop the rows containing missing values as there wasn't that many. The rational behind imputation and dropping the missing values is that the presence of those values in our dataset can introduce errors and biases into the statistical analysis we wanted to conduct. We have put more clarification on this in our README.md file. Hopefully, after this it becomes more clear. Thank you for reviewing our project!

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/0670082a3808c6ec837cddd95f2296733a4f9a46

Hello @CChCheChen,

Thank you for your valuable feedback to our team project. We would like to address some of the comments that you have mentioned above. In particular, the fact that it might be difficult to understand the content of the project for someone coming from non-health/biology background. Therefore, to improve our project further we included more glossary type explanations into our report, so it would be easier for our intended audience to read and understand the content of our project. We tried to include more explanations for medical terms and abbreviations. To address the ones you have mentioned above, ECG, or electrocardiography, is a diagnostic test that measures the electrical activity of the heart. ST depression is a term used to describe a specific pattern seen on an ECG, which can be a sign of heart disease. Indeed, it's important to provide clear and understandable explanations for these terms in order to make the project accessible and comprehensible to a wider audience. Thank you for reviewing our project.

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/49211f799fd91e5c3a32cd85a9912456244b23b7

Hello @tanmayag97,

Thank you for your valuable feedback to our team. We would like to address some of the point you have mentioned in your review. We are agree that the explanatory variables used in this study might not be very familiar to people coming from non-biology background. Therefore, we have decided to include some detailed explanation in our report to bring further clarifications. The explanatory variables in this project are the medical measurements or characteristics of the study participants.

Here is a brief overview of some of the ones you mentioned above:

trestbps: This stands for "resting blood pressure," which is the blood pressure measured after the person has been sitting or lying down for a few minutes;
restecg: This stands for "resting electrocardiography," which is a test that records the electrical activity of the heart. It is often used to diagnose heart conditions, such as arrhythmias (abnormal heart rhythms) or to assess the overall health of the heart.

We hope this will provide you with some context for the variables used in this project. Please see the correction commit if you require further clarifications or feel free to ask. Thank you for reviewing our project.

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/ce0495f478c698151b6488724c7eeb68f56eb370

Hello @shaunhutch,

Thank you for the suggestion! When presenting data in a report, it is always a good idea to provide a background information on the variables that describe the data being shown. We totally agree that by providing this information, readers can quickly understand the content of the chart without needing to refer back to the data or the result section of the report itself. Therefore, we have decided to include some background information on the explanatory variables we are dealing with at the top of the report. As you suggested, we also included the short description of each type of variable whether it is continuous or categorical right before the graphs presented. We decided to do it right before the graph instead of including it in a title as we didn't want to clutter the graphs themselves. Thank you for reviewing our project and feel free to contact us if you have any further questions.

Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/a6e22f674677ff4a11f264c61c815304a50027d6

Thank you for you additional feedback, @shaunhutch. It is important for the users to be able to navigate our project repository easily and it can be helpful to organize files into subfolders to make it easier to find what you're looking for. This is especially true for larger projects with a lots of files. We agree that in case of our analysis project, subfolders for different stages of the analysis such as data exploration (EDA) and modeling can be helpful. This way, you could easily find the files you need without having to open them all or refer to additional documentation. Therefore, we decided to split our result folder even further into EDA and analysis portions. Thank you for your feedbacks.

Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/b102bc6ee786285e98877f20283192e1a0ee8b9a

UBC-MDS / data-analysis-review-2022