Open roanraina opened 1 year ago
1) the introduction was well done and very useful to provide context of magnitude and scale for the WHY behind this approach. It may have been worth setting a target metric goal/value and quantified the impact it could have in dollars saved/lifes saved etc. so when you get a model you can say is it "good enough".
2) In the report, under analysis, it says prioritizing f1 and recall to reduce false positives, optimizing for recall reduces false NEGATIVES, i.e. we don't want to miss a diagnosis on people. You correctly state recall in the EDA section. In your conclusion you state recall as reducing false positives again.
3) In the EDA plots for Age, Education etc. the "oscillating" nature of the density plot is misleading as that is only a result of the data being binned, should have the bandwidth of the density plot increased to remove this artifact of the underlying binning. Also overlapping filled plots aren't easy to read, perhaps change to lines with mean values as vertical lines to support your conclusions in the discussion.
4) I would include each models precision-recall curves on one plot and report the AUC scores for each as the default threshold may be skewing your recall results.
5) For scripts like prediction.py break up the steps into standalone functions. This habit will make testing much easier in the future and you can then also reuse those functions in notebooks by importing them from the script. E.g. in a notebook if you have the functions you can do something like from src.predict import train_svc, predict_svc
etc.
Great work and an interesting report!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1, The descriptive analysis of the dataset is well done, they calculated Diagnosed diabetes percentage (95% CI) and Undiagnosed diabetes percentage (95% CI) and Total diabetes percentage (95% CI). This is one of their biggest bright spots. 2, They used a lot of models for comparative experiments. They adopted Dummy, Decision Tree, KNN, RBF, SVM, Logistic Regression. This enables a better and more comprehensive reflection of the impact of the data on the model, while making the report more convincing. 3, They used area plot in the EDA link, which left a deep impact, but the choice of color is not very good, not bright enough to look good. 4, The Result part is not perfect, because they use too few evaluation indicators. The confusion matrix is the most intuitive way to explain model performance, and it's the best metric to explain to people from different backgrounds, but they don't. 5, The Project Overview is well done, and it can vividly explain the problems, solutions, and overall framework clearly.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
It is satisfying to read your project, and I have also learned a lot from you. Great job!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
The introduction and the research question were presented well by the team, and it is indeed a very interesting and crucial problem that could be alleviated to a certain extent with machine learning. The team has done well in setting the expectations and the necessity of addressing the problem and explaining how machine learning can help. They've used a collection of tables and plots that convey their results and highlight trends based on feature target interactions.
Possible areas for improvement:
Running the cleaning script clean.py
threw a bunch of warnings into the terminal. It would be good to either address these warnings or suppress them if they are not applicable. Running the diabetes_eda.py
failed. Maybe you could consider Joel's suggestion for saving Altair plots. I was blocked to proceed ahead. The scripts could be improved by refactoring the code to use functions rather than having everything done in the main function. Adding tests to ensure that the required artifacts for the following scripts are created would make sure errors are caught early on.
Image with error:
Image with long list of warnings:
The introduction and EDA take up a major chunk of the report while the other areas are lacking details. I would recommend that this section of the report is made more concise while parallelly adding more information on the methods, analysis of the models, and the results.
Although the report contains sufficient tables and figures to convey details well, they are lacking captions and are not numbered correctly. For example, the first table does not have a caption. There are a total of four tables but only three have captions. The plots are missing captions. Also, as it is crucial to keep a track of the false negatives, including the confusion matrix in the discussion section would be valuable.
The clarity of the report can be improved by making sure that statements are accurate and correct. The report mentions that you're using Dummy Regressor as the baseline while this is a classification problem. The raw data contains 0,1, and 2 as the possible values for the target and hence, I expected this to be a multiclassification problem as the EDA also describes 3 classes here. But during prediction, I sense you've considered it as a binary classification problem. It would be good if you could mention the intuitions and assumptions behind this change in the report. Some of the other improvements include correcting the author names from "truetruetruetrue", reducing typos, explaining the need to use multiple models, and having a dedicated section that highlights the assumptions and the limitations of the current analysis.
The interpretation of the scores and the concluding statements could be improved by briefly restating the research question and an overall comment on the models' false negatives and their impact on the prediction. The statement "When it comes to the recall scores, the SVC score was 0.792 compared to the 0.792 for logistic regression." could be corrected with the correct values from the table.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thank you fo your feedback. You have suggested to reduce the introduction part of the report so that it does not distract the readers as the analysis and the as the analysis and the results of the study are the most important. We do understand the point and we agree with it and that is why we have fixed the introduction. We have kept it short and concise. The commit of that change can be found here
Thank you for your feedback. You have pointed out that the definition of recall in the EDA section of the study and the result section were not the same. This is correct, we are interested in our study in predicting diabetes and not overly concerned with false positives and therefore we are prioritizing recall. We have updated our definition of recall in the EDA and in the analysis in a way that they do not conflict. The change can be found here
Thank you for your feedback. You have highlighted that the scores for SVC and the one for logistic regression are not correctly reported in the report, and you also suggested that we improve the concluding statements. You are right, there was a mismatch in the SVC and the Logistic Regression scores. We have fixed that and also improved the concluding statements as suggested and the proof of those changes can be find here
Thank you for your feedback. You mentioned that the report is not perfect and you also added that there are too few evaluation indicators in the result section. We understand and we do agree with your feedback and we have reviewed the whole section and improved a couple of points in our report and we added a limit section to capture the limitations of our study ; you can view our change here
The reviewer mentioned that our report lacks captions and when they are there, they are not numbered properly. We understand how crucial this is as it makes our work more organized and helps the reader to understand the flow of our work. We have taken that feedback into account and we have labelled all the plots with accurate number and caption and this can be found here
Submitting authors: @roanraina @austin-shih @mehdi-naji @florawendy19
Repository: https://github.com/UBC-MDS/diabetes_prediction Report link: https://github.com/UBC-MDS/diabetes_prediction/blob/main/doc/diabetes_report.md Abstract/executive summary: The prevalence and risk of diabetes is a major health concern to everyone around the world. Various factors, including lifestyle, diet, and health information can facilitate diagnoses of this disease. Due to the advancements in data availability, modern data analysis techniques can be employed to speed up and improve the accuracy of disease diagnosis. In this report, we discuss our first attempt at predicting the diagnosis of diabetes, based on standard machine learning methods. It is worth noting that this project is not an original scientific research, and its results cannot be practically used or generalized. This is simply teamwork to cultivate what we have learned in the MDS program at UBC.
Editor: @flor14 Reviewer: @tieandrews @BruceUBC @rkrishnan-arjun @Althrun-sun