UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: 17: heart_disease_predictor #2

Open elenagan opened 1 year ago

elenagan commented 1 year ago

Submitting authors: @Natalie-cho @Yurui-Feng @elenagan @tzoght

Repository: https://github.com/UBC-MDS/heart_disease_predictor Report link: https://github.com/UBC-MDS/heart_disease_predictor/blob/main/book.pdf Abstract/executive summary: Responsible for 16% of the world's total deaths in 2019, heart disease is the world's leading cause of death according to the World Health Organization. The development of heart disease can not be contributed to a single factor in isolation, making early detection difficult given so many risk factors.

The goal of this project is to predict the presence of heart disease based on common early signs and the Heart Disease UCI dataset from the UC Irvine Machine Learning Repository to answer the question: given common early signs and physiological indicators, can we predict the presence of cardiac disease based on symptoms such as chest pain, blood pressure, or resting ECG?

Responding to this question may aid in the early detection of heart disease and may help with earlier treatment, crucial to improving an individual's chances of survival.

Editor: @flor14 Reviewer: Luke Yang, Caesar Wong, Xinru Lu, Manvir Kohli

lukeyf commented 1 year ago

General

Hello, Group 17. Congratulation on your work on this heart disease predictor. Below are my comments based on your project!

Data analysis review checklist

Reviewer: @lukeyf

Conflict of interest

Code of Conduct

General checks

Comments:

The src contains concisely the four files that were used for the pipeline of analysis. The structure is clear and no files are too deep from the root of the project.

Documentation

Code quality

Comments:

Yep. Functions are well-written and well-documented. The scripts are modular with helper functions.

Reproducibility

Comments:

The source code in src is clear on which file to call. I was able to execute until the analysis. But when I was trying to generate the report it returns the error pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded. I was not sure if this was only my machine so if others return a similar problem please note that.

Analysis report

Comments:

The writing was coherent and concise. The eda was not too overwhelming and the result is clear. However, I notice that in your book.pdf one of the tables is cut off because it was too long. I suggest removing some of the unnecessary contents like standard deviation to only reveal the meet (test/train scores)

Estimated hours spent reviewing: 1

Review Comments:

The comments can be summarized in the points below:

Overall, the project is in a good shape toward completion. The scripts are very solid and the analysis was quite insightful. There are a few things I mentioned in the previous comments and if you have time you can consider addressing them.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

caesarw0 commented 1 year ago

Data analysis review checklist

Reviewer: @caesarw0

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Note: the whole evaluation is done using the Milestone 3 release version of the project repository (tag: 0.2.0, commit: 4ba76b9)

  1. First of all, I really appreciate the level of detail and effort your team has put into it. The whole project is well constructed with comprehensive materials and extra components. I personally like the code documentation and the linkage between different modules.

  2. I notice there are two functions, namely model and test in the src/model.py. Since these two functions are from the two main processes of the machine learning pipeline, I would suggest splitting model training and model testing into two separate scripts. By decoupling these 2 processes, the user can determine whether we choose either to train or test the model. It can provide more flexibility for the user.

  3. In terms of code modularity and readability, there is a save_chart function defined within a function called eda in the src/eda.py. It is better to separate the save_chart from the master function or create a utility folder for organizing the util related function, like save_chart, so that other scripts can also make use of the chart saving function. It can improve code readability and scalability.

  4. In the readme file, the team mentioned using 4 machine learning models, however, 2 of the models are missing in the actual implementation. Perhaps it can include more models in the analysis if time permits. But personally, I think 2 models are enough for this project.

  5. There is a minor issue when I run the make all command. When the GitHub repository is in a path that contains spaces (e.g. C:\Users\abc\UBC MDS\DSCI 522 Workflows\git), some extra folders are generated. (see below)

image

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Lorraine97 commented 1 year ago

Data analysis review checklist

Reviewer: @Lorraine97

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hr

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. In the EDA analysis, when plotting the correlation between categorical variables, and between a categorical variable and a numeric variable, it would be better to use mark_square, mark_rect, etc. functions to visualize the distribution over the values. Scatter plots do not seem to show the distribution effectively here.
  2. Since parameter optimization is done for the models, it might be helpful to include the specific parameters being used in the results table for model selection. In this way, we know that the models being compared are already optimized.
  3. A random idea: maybe we can use anova to prove one model is significantly better than another?
  4. The color for the cross validation result plot in "Test Results" section (in the book.pdf) does not seem to follow the color theory. I might be wrong, but would it better to have one color and differentiate by saturation?
  5. Nothing else stands out to me. It is really nice work!! Good luck with the rest of the project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

manvirsingh96 commented 1 year ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

  1. The EDA and final reports are nice and concise with proper headings for each section. Maybe try to have less code in the EDA file and if possible no code in the final report.
  2. The plots used, especially in the final report, are well labelled and easy to follow and interpret.
  3. I belive the ReadMe file is not updated as there is mismatch in the name of the the Python script used to download data. The Readme file mentions running the script "download_data.py" in the src directory but the src directory does not have this script. Instead there is a "fetch_dataset.py" which I believe is supposed to be used to download the dataset.
  4. It may be helpful to include the url needed to download the datastet within the Readme file as well as include it when stating how to run the "download_data.py". The current link in the Readme file links to Kaggle and seems to be broken.
  5. Your report mentions there is a slight class imbalance because of which the metric used is the F1 score. With the problem at hand you could maybe try to deal with the class imbalance and use recall as the metric of choice, given that you are already calculating it.
  6. The results from the correlation plots state that there is a correlation between "max_hr_achieved" and the target. However I could not find this feature in the EDA. I believe you are renaming an existing feature. To avoid confusion it may be helpful to include the final features with their names and data types either at the end of your EDA or in the beginning of the final report
  7. Coming back to the correlation plots, the correlation method used is the Spearman correlation, which I believe is used to check correlation between ordinal/ranked variables. However here the correlation is being calcualted between a continuous variable (max_hr_achieved) and a binary variable (Heart Disease vs no Heart Diseases). As such, correlation between the two variables may not be interpretable. However, if it is, you may include the reasoning behind using this metric.
  8. A minor suggestion would be to give a meaningful name to the final report. Currently it's named "book.pdf". The name is not very intuitive and the Readme file does not explicity state what the final report is called. If someone is browsing through the repository, it is difficult to identify the report.

Attribution

This was derived from the [JOSE review checklist](https://openjournals.readthedocs.io/en/jose/review_checklist.

elenagan commented 1 year ago