DSCI-310 / data-analysis-review-2021

0 stars 1 forks source link

Submission: 2: Heart Disease Prediction #2

Open ttimbers opened 2 years ago

ttimbers commented 2 years ago

Submitting authors: @eahn01 @dliviya @rpeng35 @nikizamani

Repository: https://github.com/DSCI-310/DSCI-310-Group-2

Abstract/executive summary: Common heart diseases include disease of the blood vessel, arrhythmia (irregular beating of the heart), disease of the heart valve and muscle, infection of the heart, and heart defects from birth. The symptoms of heart diseases are often times unnoticeable and most are only diagnosed after a heart attack,heart failure, or stroke. In this project, we want to be able to predict if someone is at risk of a heart disease based on the variables given in the dataset.

Editor: @ttimbers

Reviewer: @Shravan37 @AaronMKk @ChoAllan @samzzzzzh

Shravan37 commented 2 years ago

Data analysis review checklist

Reviewer: Shravan37

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 60 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The analysis has a few flaws that make it difficult to replicate. When executing the makefile, an error occurs. I'm unable to complete the analysis using the technique described in readme.md. Also the readme could have been better explaining all the steps. And because of that problem I am unable to access a pdf or HTML.

  2. The documentation of the project is very good and each script file is relatively simple, allowing the reader to quickly understand what each script accomplishes. This is true for each function as well, because they are all extensively documented. The only thing that, in my opinion, would improve the documentation is to include all the functions and scripts in the src folder.

  3. The Introduction and discussion could have been better like, in my opinion, the introduction doesn't clearly state what the research question is and why are we considering the given variables and in discussion you can also look into which variable has the greatest effect on the predicting whether one has heart disease or not. Further the plots and figures could be explained a bit more in depth like what the graphs represent and what does it mean. But overall the report is very good and important for considering the topic they chose.

    Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

AaronMKk commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 60 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

1

First of all, I think you missed the attribe_box_plots.r file under "/opt/notebooks/R" and the processed-cleveland.csv file under "/opt/notebooks/data/processed-cleveland.csv", this causes some errors when I run the notebook file on the docker container. I think it's mainly because when you focus on milestone 3 and made a lot of changes on the file names, but forget to update it on the predicting_heart_disease.ipynb file. This makes the .ipynb file difficult to run interactively.

2

"In this project, we used 13 attributes to predict whether a person has heart disease or not." "We used the K-nearest neighbors algorithm for our classification". "However we found that our accuracy was low (73%)" Based on what you said in the discussion part, I think this low accuracy might result from the fact that k-nearest neighbors are especially sensitive to the “Curse of Dimensionality(The size of the data space grows exponentially with the number of dimensions)”. maybe logistic regression is a better tool considering that there are 13 explanatory variables.

3

I fail to render the "doc/heart_disease.html doc/heart_disease.pdf" file. So I try to knit the heart_disease.rmd file locally, but still get some errors, like missing ":" after "github_document". So, The readme could have done a better job of describing everything, and getting rid of some typos.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ChoAllan commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. "Next we create our classifier. The only parameter we can tune is the number of classifier we have. We are tuning our classifier to get the optimal number of neighbors as to increase accuracy."

This may be a small thing, but from my understanding, the hyperparameter you are tuning is not the number of classifiers. I believe that you are tuning the k-value, which determines how many neighbours are being used given an unspecified center.

  1. I like the different usages of figures to help me understand the dataset better. The figures prevent me from getting lost in huge amounts of text.

  2. I cannot knit the 'heart_disease.rmd' file. I will get an error staing that: Error in file(filename, "r", encoding = encoding): cannot open connection I had to do additional work to debug the issue.

  3. Overall, the project is organized and well put together. Most files are put in their respective folders and named optimally. The README.md enables me to run the analysis without much tinkering.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

samzzzzzh commented 2 years ago

Data analysis review checklist

Reviewer: samzzzzzh

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: one hour and half

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Before getting into the feedbacks I want to say good job to all of you since the project looks really solid and I really like the research question! Being able to predict risk of heart disease is pretty cool and could potentially save tons of lives. However there are couple small flaws and I noted them below. Hopefully this can help you folks in the final project!

1) First, I am having trouble reproducing the results. There seem to be issue with running the make file or rendering the report to .pdf or .html. Below is what I get when I try to run the analysis following the instructions given in README.md.

"Quitting from lines 18-31 (heart_disease.rmd) Error in file(filename, "r", encoding = encoding) : cannot open the connection

Execution halted make: *** [Makefile:23: doc/heart_disease.md] Error 1"

I tried to knit the .rmd file locally but errors are still occurring. I think there may be some typos in the code that needs to be sorted out. One way to solve this problem is to be more concrete in the instructions of how to run the analysis. I would assume whoever is reading my file knows nothing about any of the software and take them step by step. Maybe this will help with the issue of reproducibility or at least make the process easier for other users.

2) There seems to be some issues with the tests files. The tests themselves seems fine but there are inconsistencies with the test suite. The below is what I get when I try to run "Rscript tests/testthat.R"

"Error in library(DSCI - 310 - Group - 2) : 'package' must be of length 1 Execution halted"

I suspect that there may be issues with the name. The spacing in between the names seems odd and I could not get the test suite running. Try search up "package must be length 1" and I am sure the solution is somewhere on the internet.

3) The plots in the analysis report could need a bit more work. One rule of thumb that I learned from other stat courses is to avoid pie charts as they can be misleading. I would suggest changing "distribution_of_diagnosis.png" to a bar chart and also attach some figures/percentages onto it. Also, in "variable_correlation.png", some of the column names are overlapped with the numbers which can be ambiguous and distracting in reading the plot. So maybe either decrease the font of the column names or move the figures aside. Lastly, more interpretation can be added to the plots. Expand more on what they represent, their implications and their connections to the research question.

4) The discussion section mostly describes what happened in the report and merely stating the results and findings. I think a bit more work could've done on interpretation, your opinions and connect theses interesting results/findings to the research question. That being said, I like the future questions that the project can lead to. I can see that the shortcoming of the model is stated and can be better improved in the future. Very well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.