UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group12: heart disease inferential study #14

Open kellywujy opened 1 year ago

kellywujy commented 1 year ago

Submitting authors: @kellywujy @stepanz25 @ZilongYi @BruceUBC

Repository: https://github.com/UBC-MDS/inferential_study_heart_attack Report link: https://github.com/UBC-MDS/inferential_study_heart_attack/blob/main/doc/heart_disease_report.Rmd Abstract/executive summary: In this project we attempt to find the association between the presence of heart disease and various demographic or health factors of the patients including age, sex, chest pain type, cholesterol levels, etc. We perform hypothesis testing using permutation for numerical variables such as age, the maximum heart rate achieved, and ST depression induced by exercise relative to rest which is considered a proven ECG finding for obstructive coronary atherosclerosis (Lanza et al., 2004). Our original data set also included some categorical variables and we conducted hypothesis testing using chi-squared test to see if these factors relate with presence of heart disease.

Editor: @flor14 Reviewer: < Hutchinson Shaun > < Lin Chen > < Agarwal Tanmay >

CChCheChen commented 1 year ago

Data analysis review checklist

Reviewer: Chen Lin @CChCheChen

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. This project to find the relationship between the possibility of heart disease and some potential factors from patients is informative. While it could be hard for people who is not from health or biology background, like me, to understand some glossaries or abbreviations, such as ECG or ST depression. Those terms could be explained ahead of time if possible.
  2. I particularly like the visual explanation of the files in the src folder, including their input and output visually. Well done team!
  3. Code well constructed and documented with necessary docstring for user to understand the purpose of your code. One suggestion would be to follow the DRY coding principle to reduce some code repetition, for example data_analysis.r has the following sections could have some improvements:
    • MEAN CI
    • hypothesis testing with permutation
    • chi-square testing
  4. Final report has the tables to demonstrate the hypothesis testing results, for both numeric and categorical factors, assembling the conclusions for each factor side by side which makes them easier to understand.
  5. The plots in the final report could be faceted by class (having heart or not) to make the histogram more interpretable especially when the histogram overlapped a lot for bot classes, for example the Histogram of trestbps. Also the plot legend should be more clear of which class it belongs to, instead of 1 or 0.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shaunhutch commented 1 year ago

Data analysis review checklist

Shaun Hutchinson: @shaunhutch

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. I tried to create the environment from the environment.yml file, however, I ran into an error. `ResolvePackageNotFound:

    • dataframe-image==0.1.3
    • docopt==0.7.1`

    It looks like dataframe-image should be dataframe_image. In addition, I believe that docopt should be under pip as docopt-ng rather than in conda as per the installation instructions we received in lecture 2.

  2. In trying to run your usage scripts I ran into a couple of issues. The download_file.py has an indentation error at line 20. I believe that you need Seaborn to run the EDA_visualization.py script which you have not listed in your dependencies. The pre_process script seems to have to reference the wrong script.
  3. The organization of your repository is clear with the addition of the USAGE image, however, I think that the results folder could benefit from further organization into subfolders such as eda and analysis. I found as a user looking at the titles was not enough, I had to refer to other documentation or open all files to figure out which area of the project they pertained to.
  4. I found all the codes easy to read with informative usage and examples which helped with understanding how to run the individual scripts.
  5. I think it would be useful in your EDA section of the report to reference what the numerical columns and categorical columns in the charts are. This could be done either in the Chart Titles or in the discussion around it. When reading these histograms it is hard to tell what these variables are from the report alone.

Very interesting project, it was a pleasure to review!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

qurat-azim commented 1 year ago

Data analysis review checklist

Reviewer: Qurat-ul-Ain Azim @qazim1

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

tanmayag97 commented 1 year ago

Data analysis review checklist

Reviewer: Tanmay Agarwal @tanmayag97

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hr

Review Comments:

  1. The EDA is done in a well constructed and concise manner. However, I want to point out that there are null values when looking at dataframe.info(), however the analysis says there are no null values in the dataset. It would be great if you can rectify this minor mistake.

  2. While the choice of EDA plots are interesting, something that can also be done is plotting a correlation heatmap for numerical features considering both targets i.e. heart disease and no heart disease. This will provide further insight into the explantory and the response variables.

  3. Overall, the project is well organized. I, specifically really like the usage section displaying all the locations of all files which is truly commendable. Great job folks!

  4. There should be a little insight into the various explanatory variables like trestbps, restecg etc for viewers having a non medical background. This will increase readibility of the overall project and help people understand the work better.

  5. While the overall report is presented in a clean and effective way, I feel that the references are too cluttered and could be organized in a better way (like adding numbers, or handling spacing/indenting).

Really loved the idea and the overall report. There is great scope in this project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

stepanz25 commented 1 year ago

Feedback Review

1.

Hello @Qazim1,

Thank you for your valuable feedback provided to our group. We would like to let you know that you opinion is very important to us and we would like to address some of the points you mentioned above. In particular, we would like to address the issue with the missing values. We totally agree that there may have been some confusion on how our collaborators dealt with missing values. The Cleveland data set that we are dealing with was partially pre-processed as it's one of the commonly used data sets used by data science community; however, there were still some missing values present. Our strategy was to impute those missing values with 'NaN' and subsequently drop the rows containing missing values as there wasn't that many. The rational behind imputation and dropping the missing values is that the presence of those values in our dataset can introduce errors and biases into the statistical analysis we wanted to conduct. We have put more clarification on this in our README.md file. Hopefully, after this it becomes more clear. Thank you for reviewing our project!

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/0670082a3808c6ec837cddd95f2296733a4f9a46

2.

Hello @CChCheChen,

Thank you for your valuable feedback to our team project. We would like to address some of the comments that you have mentioned above. In particular, the fact that it might be difficult to understand the content of the project for someone coming from non-health/biology background. Therefore, to improve our project further we included more glossary type explanations into our report, so it would be easier for our intended audience to read and understand the content of our project. We tried to include more explanations for medical terms and abbreviations. To address the ones you have mentioned above, ECG, or electrocardiography, is a diagnostic test that measures the electrical activity of the heart. ST depression is a term used to describe a specific pattern seen on an ECG, which can be a sign of heart disease. Indeed, it's important to provide clear and understandable explanations for these terms in order to make the project accessible and comprehensible to a wider audience. Thank you for reviewing our project.

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/49211f799fd91e5c3a32cd85a9912456244b23b7

3.

Hello @tanmayag97,

Thank you for your valuable feedback to our team. We would like to address some of the point you have mentioned in your review. We are agree that the explanatory variables used in this study might not be very familiar to people coming from non-biology background. Therefore, we have decided to include some detailed explanation in our report to bring further clarifications. The explanatory variables in this project are the medical measurements or characteristics of the study participants.

Here is a brief overview of some of the ones you mentioned above:

We hope this will provide you with some context for the variables used in this project. Please see the correction commit if you require further clarifications or feel free to ask. Thank you for reviewing our project.

Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/ce0495f478c698151b6488724c7eeb68f56eb370

4.

Hello @shaunhutch,

Thank you for the suggestion! When presenting data in a report, it is always a good idea to provide a background information on the variables that describe the data being shown. We totally agree that by providing this information, readers can quickly understand the content of the chart without needing to refer back to the data or the result section of the report itself. Therefore, we have decided to include some background information on the explanatory variables we are dealing with at the top of the report. As you suggested, we also included the short description of each type of variable whether it is continuous or categorical right before the graphs presented. We decided to do it right before the graph instead of including it in a title as we didn't want to clutter the graphs themselves. Thank you for reviewing our project and feel free to contact us if you have any further questions.

Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/a6e22f674677ff4a11f264c61c815304a50027d6

5.

Thank you for you additional feedback, @shaunhutch. It is important for the users to be able to navigate our project repository easily and it can be helpful to organize files into subfolders to make it easier to find what you're looking for. This is especially true for larger projects with a lots of files. We agree that in case of our analysis project, subfolders for different stages of the analysis such as data exploration (EDA) and modeling can be helpful. This way, you could easily find the files you need without having to open them all or refer to additional documentation. Therefore, we decided to split our result folder even further into EDA and analysis portions. Thank you for your feedbacks.

Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/b102bc6ee786285e98877f20283192e1a0ee8b9a