UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: GROUP 2: Diabetes Prediction #18

Open roanraina opened 1 year ago

roanraina commented 1 year ago

Submitting authors: @roanraina @austin-shih @mehdi-naji @florawendy19

Repository: https://github.com/UBC-MDS/diabetes_prediction Report link: https://github.com/UBC-MDS/diabetes_prediction/blob/main/doc/diabetes_report.md Abstract/executive summary: The prevalence and risk of diabetes is a major health concern to everyone around the world. Various factors, including lifestyle, diet, and health information can facilitate diagnoses of this disease. Due to the advancements in data availability, modern data analysis techniques can be employed to speed up and improve the accuracy of disease diagnosis. In this report, we discuss our first attempt at predicting the diagnosis of diabetes, based on standard machine learning methods. It is worth noting that this project is not an original scientific research, and its results cannot be practically used or generalized. This is simply teamwork to cultivate what we have learned in the MDS program at UBC.

Editor: @flor14 Reviewer: @tieandrews @BruceUBC @rkrishnan-arjun @Althrun-sun

tieandrews commented 1 year ago

Data analysis review checklist

Reviewer: tieandrews

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Feedback:

1) the introduction was well done and very useful to provide context of magnitude and scale for the WHY behind this approach. It may have been worth setting a target metric goal/value and quantified the impact it could have in dollars saved/lifes saved etc. so when you get a model you can say is it "good enough". 2) In the report, under analysis, it says prioritizing f1 and recall to reduce false positives, optimizing for recall reduces false NEGATIVES, i.e. we don't want to miss a diagnosis on people. You correctly state recall in the EDA section. In your conclusion you state recall as reducing false positives again. 3) In the EDA plots for Age, Education etc. the "oscillating" nature of the density plot is misleading as that is only a result of the data being binned, should have the bandwidth of the density plot increased to remove this artifact of the underlying binning. Also overlapping filled plots aren't easy to read, perhaps change to lines with mean values as vertical lines to support your conclusions in the discussion. 4) I would include each models precision-recall curves on one plot and report the AUC scores for each as the default threshold may be skewing your recall results. 5) For scripts like prediction.py break up the steps into standalone functions. This habit will make testing much easier in the future and you can then also reuse those functions in notebooks by importing them from the script. E.g. in a notebook if you have the functions you can do something like from src.predict import train_svc, predict_svc etc.

Great work and an interesting report!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Althrun-sun commented 1 year ago

Data analysis review checklist

Reviewer: Althrun-sun

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

1, The descriptive analysis of the dataset is well done, they calculated Diagnosed diabetes percentage (95% CI) and Undiagnosed diabetes percentage (95% CI) and Total diabetes percentage (95% CI). This is one of their biggest bright spots. 2, They used a lot of models for comparative experiments. They adopted Dummy, Decision Tree, KNN, RBF, SVM, Logistic Regression. This enables a better and more comprehensive reflection of the impact of the data on the model, while making the report more convincing. 3, They used area plot in the EDA link, which left a deep impact, but the choice of color is not very good, not bright enough to look good. 4, The Result part is not perfect, because they use too few evaluation indicators. The confusion matrix is the most intuitive way to explain model performance, and it's the best metric to explain to people from different backgrounds, but they don't. 5, The Project Overview is well done, and it can vividly explain the problems, solutions, and overall framework clearly.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

BruceUBC commented 1 year ago

Data analysis review checklist

Reviewer: BruceUBC

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hrs

Review Comments:

  1. The project was done completely and someone other than the authors can understand the target and the fundamental processes in the project easily. All of the scripts have functions defined clearly and correctly. Besides, the whole report is well organized since it includes every necessary part in a meaningful order.
  2. For the introduction part of the report, the overall length could be shorter. Otherwise, readers may get distracted by reading all the data and tables. In my opinion, the significant part should be the analysis and results part following the introduction. To be specific, some of the tables in the introduction could be eliminated or set to have a smaller size.
  3. The main plot of the EDA could be smaller. For those categorical variable, perhaps it is benefitial to plot only the count of 0 and 1 in the middle of the histogram. Therefore, it is more convenient for readers to compare the effect of categorical variables on the target. For numeric variables, the correlation table is a good tool to give a direct view of the data if it is feasible.
  4. For the results part, the table could be improved by listing the two machine learning methods in two distinct columns and set the metrics in three different row. By the way, NA is not a good name for a row together with '...1' in the corner. :)
  5. Some of the functions in the script could have some additional documentation in order to help others understand the process more clearly. For instance, adding some purposes after some of the processes such as creating the transformer and classifier would be appreciated by someone who is not familiar in the field.

It is satisfying to read your project, and I have also learned a lot from you. Great job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

rkrishnan-arjun commented 1 year ago

Data analysis review checklist

Reviewer: rkrishnan-arjun

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 Hours.

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The introduction and the research question were presented well by the team, and it is indeed a very interesting and crucial problem that could be alleviated to a certain extent with machine learning. The team has done well in setting the expectations and the necessity of addressing the problem and explaining how machine learning can help. They've used a collection of tables and plots that convey their results and highlight trends based on feature target interactions.

Possible areas for improvement:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

florawendy19 commented 1 year ago
  1. Feedback from @BruceUBC:

Thank you fo your feedback. You have suggested to reduce the introduction part of the report so that it does not distract the readers as the analysis and the as the analysis and the results of the study are the most important. We do understand the point and we agree with it and that is why we have fixed the introduction. We have kept it short and concise. The commit of that change can be found here

  1. Feedback from @tieandrews:

Thank you for your feedback. You have pointed out that the definition of recall in the EDA section of the study and the result section were not the same. This is correct, we are interested in our study in predicting diabetes and not overly concerned with false positives and therefore we are prioritizing recall. We have updated our definition of recall in the EDA and in the analysis in a way that they do not conflict. The change can be found here

  1. Feedback from @rkrishnan-arjun

Thank you for your feedback. You have highlighted that the scores for SVC and the one for logistic regression are not correctly reported in the report, and you also suggested that we improve the concluding statements. You are right, there was a mismatch in the SVC and the Logistic Regression scores. We have fixed that and also improved the concluding statements as suggested and the proof of those changes can be find here

  1. Feedback from @Althrun-sun

Thank you for your feedback. You mentioned that the report is not perfect and you also added that there are too few evaluation indicators in the result section. We understand and we do agree with your feedback and we have reviewed the whole section and improved a couple of points in our report and we added a limit section to capture the limitations of our study ; you can view our change here

  1. @BruceUBC

The reviewer mentioned that our report lacks captions and when they are there, they are not numbered properly. We understand how crucial this is as it makes our work more organized and helps the reader to understand the flow of our work. We have taken that feedback into account and we have labelled all the plots with accurate number and caption and this can be found here