Open kellywujy opened 1 year ago
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
src
folder, including their input and output visually. Well done team!data_analysis.r
has the following sections could have some improvements:
MEAN CI
hypothesis testing with permutation
chi-square testing
Histogram of trestbps
. Also the plot legend should be more clear of which class it belongs to, instead of 1 or 0.This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
I tried to create the environment from the environment.yml file, however, I ran into an error. `ResolvePackageNotFound:
It looks like dataframe-image
should be dataframe_image
. In addition, I believe that docopt
should be under pip
as docopt-ng
rather than in conda as per the installation instructions we received in lecture 2.
download_file.py
has an indentation error at line 20. I believe that you need Seaborn to run the EDA_visualization.py
script which you have not listed in your dependencies. The pre_process
script seems to have to reference the wrong script.Very interesting project, it was a pleasure to review!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
The project is presented very well. I can see that the group has put in efforts to even the tiniest of details. I personally learned a lot reviewing this. The report is very well written. I particularly like how the limitations of the study are discussed as well as future improvements are addressed.
In the README file, the link to the EDA file is broken. It would be great if the issue is fixed.
The EDA charts in the report are great. However, it would help to briefly outline what these feature names mean so that they would make a bit more sense to the reader. For example, it's not immediately clear what oldpeak
or , say, thalach
represent.
A bit of column renaming for the tables in the report should help. For example, you could say Mean
rather than sample_estimate
and remove underscores too to help the reader.
In the preprocessing script, I see that some missing values are imputed while the authors say in the Dataset heading of the README that they'll be using cleaned and. processed data. It would probably be best to reconcile the two statements by explaining explicitly what has been done in the preprocessing script.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
The EDA is done in a well constructed and concise manner. However, I want to point out that there are null values when looking at dataframe.info(), however the analysis says there are no null values in the dataset. It would be great if you can rectify this minor mistake.
While the choice of EDA plots are interesting, something that can also be done is plotting a correlation heatmap for numerical features considering both targets i.e. heart disease and no heart disease. This will provide further insight into the explantory and the response variables.
Overall, the project is well organized. I, specifically really like the usage section displaying all the locations of all files which is truly commendable. Great job folks!
There should be a little insight into the various explanatory variables like trestbps, restecg etc for viewers having a non medical background. This will increase readibility of the overall project and help people understand the work better.
While the overall report is presented in a clean and effective way, I feel that the references are too cluttered and could be organized in a better way (like adding numbers, or handling spacing/indenting).
Really loved the idea and the overall report. There is great scope in this project.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Feedback Review
1.
Hello @Qazim1,
Thank you for your valuable feedback provided to our group. We would like to let you know that you opinion is very important to us and we would like to address some of the points you mentioned above. In particular, we would like to address the issue with the missing values. We totally agree that there may have been some confusion on how our collaborators dealt with missing values. The Cleveland data set that we are dealing with was partially pre-processed as it's one of the commonly used data sets used by data science community; however, there were still some missing values present. Our strategy was to impute those missing values with 'NaN' and subsequently drop the rows containing missing values as there wasn't that many. The rational behind imputation and dropping the missing values is that the presence of those values in our dataset can introduce errors and biases into the statistical analysis we wanted to conduct. We have put more clarification on this in our README.md file. Hopefully, after this it becomes more clear. Thank you for reviewing our project!
Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/0670082a3808c6ec837cddd95f2296733a4f9a46
2.
Hello @CChCheChen,
Thank you for your valuable feedback to our team project. We would like to address some of the comments that you have mentioned above. In particular, the fact that it might be difficult to understand the content of the project for someone coming from non-health/biology background. Therefore, to improve our project further we included more glossary type explanations into our report, so it would be easier for our intended audience to read and understand the content of our project. We tried to include more explanations for medical terms and abbreviations. To address the ones you have mentioned above, ECG, or electrocardiography, is a diagnostic test that measures the electrical activity of the heart. ST depression is a term used to describe a specific pattern seen on an ECG, which can be a sign of heart disease. Indeed, it's important to provide clear and understandable explanations for these terms in order to make the project accessible and comprehensible to a wider audience. Thank you for reviewing our project.
Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/49211f799fd91e5c3a32cd85a9912456244b23b7
3.
Hello @tanmayag97,
Thank you for your valuable feedback to our team. We would like to address some of the point you have mentioned in your review. We are agree that the explanatory variables used in this study might not be very familiar to people coming from non-biology background. Therefore, we have decided to include some detailed explanation in our report to bring further clarifications. The explanatory variables in this project are the medical measurements or characteristics of the study participants.
Here is a brief overview of some of the ones you mentioned above:
We hope this will provide you with some context for the variables used in this project. Please see the correction commit if you require further clarifications or feel free to ask. Thank you for reviewing our project.
Correction Commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/ce0495f478c698151b6488724c7eeb68f56eb370
4.
Hello @shaunhutch,
Thank you for the suggestion! When presenting data in a report, it is always a good idea to provide a background information on the variables that describe the data being shown. We totally agree that by providing this information, readers can quickly understand the content of the chart without needing to refer back to the data or the result section of the report itself. Therefore, we have decided to include some background information on the explanatory variables we are dealing with at the top of the report. As you suggested, we also included the short description of each type of variable whether it is continuous or categorical right before the graphs presented. We decided to do it right before the graph instead of including it in a title as we didn't want to clutter the graphs themselves. Thank you for reviewing our project and feel free to contact us if you have any further questions.
Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/a6e22f674677ff4a11f264c61c815304a50027d6
5.
Thank you for you additional feedback, @shaunhutch. It is important for the users to be able to navigate our project repository easily and it can be helpful to organize files into subfolders to make it easier to find what you're looking for. This is especially true for larger projects with a lots of files. We agree that in case of our analysis project, subfolders for different stages of the analysis such as data exploration (EDA) and modeling can be helpful. This way, you could easily find the files you need without having to open them all or refer to additional documentation. Therefore, we decided to split our result folder even further into EDA and analysis portions. Thank you for your feedbacks.
Correction commit: https://github.com/UBC-MDS/inferential_study_heart_attack/commit/b102bc6ee786285e98877f20283192e1a0ee8b9a
Submitting authors: @kellywujy @stepanz25 @ZilongYi @BruceUBC
Repository: https://github.com/UBC-MDS/inferential_study_heart_attack Report link: https://github.com/UBC-MDS/inferential_study_heart_attack/blob/main/doc/heart_disease_report.Rmd Abstract/executive summary: In this project we attempt to find the association between the presence of heart disease and various demographic or health factors of the patients including age, sex, chest pain type, cholesterol levels, etc. We perform hypothesis testing using permutation for numerical variables such as age, the maximum heart rate achieved, and ST depression induced by exercise relative to rest which is considered a proven ECG finding for obstructive coronary atherosclerosis (Lanza et al., 2004). Our original data set also included some categorical variables and we conducted hypothesis testing using chi-squared test to see if these factors relate with presence of heart disease.
Editor: @flor14 Reviewer: < Hutchinson Shaun > < Lin Chen > < Agarwal Tanmay >