UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 14: maternal_health_risk_predictor #13

Open wakesyracuse7 opened 1 year ago

wakesyracuse7 commented 1 year ago

Submitting authors: @wakesyracuse7, @lennonay, @shlrley

Repository: https://github.com/UBC-MDS/maternal_health_risk_predictor Report link: https://github.com/UBC-MDS/maternal_health_risk_predictor/blob/main/doc/final_report.md Abstract/executive summary:

In this project, we propose a Decision Tree classification model to predict whether an individual may be at low, mid, or high maternal health risk given some information about their age and health. Our final chosen model had a max depth of 29, and performed relatively well on unseen data with 203 observations. The test score was 0.823, with 53 out of 60 high risk targets predicted correctly. However, further steps can be taken to improve the model, such as tuning or other hyperparameters or grouping the target classes into high risk and 'other'.

The full data set was sourced from the UCI Machine Learning Repository (Dua and Graff 2017), and can be found here. A .csv format of the data can be directly downloaded using this link. The data can be attributed to Marzia Ahmed (Daffodil International University, Dhaka, Bangladesh) and Mohammod Kashem (Dhaka University of Science and Technology, Gazipur, Bangladesh) (Ahmed and Kashem, 2020).

Editor: @flor14 Reviewer: HanChen Wang, Yukon Zhang, Daniel Cairns

hcwang24 commented 1 year ago

Data analysis review checklist

Reviewer: hcwang24

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 3 hours.

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Hi, Group 14,

The team of Au-Yeung, Wang, and Zhang perform a machine-learning (ML) study aiming to classify pregnant women into high, medium, and low-risk groups of maternal mortality based on some health measurements (such as age, bloop pressure, glucose levels, etc). In this study, they started by performed exploratory data analysis, experimented with classification using multiple ML models, and proposed the best estimator to be DecisionTree Classifier with max_depth=29. The proposed model has a mean score of 0.823 when predicting the test data set. This is a good test score given there isn't a significant class imbalance issue.

Below are some constructive feedbacks listed in order of importance I think the authors may consider.

  1. The decision to go with DecisionTreeClassifier is not comprehensive. Perhaps the authors can first discuss what each model does in making predictions and then conclude why DecisionTreeClassifier is an appropriate estimator compared to SVM, for example.
  2. Analysis of the results is somewhat incomplete. The authors mentioned in Future Directions that they can set the classes as binary by merging high-medium or medium-low risk groups. Is it possible to do this before the final report and thereby improve the model by setting precision, recall, or f1-score as the scoring metric?
  3. If choosing another scoring metric, which score should the model maximize, and the reason behind it?
  4. The background on the risk levels is lacking. I understand that this might be out of the expertise of the authors, but please indicate any resource to help readers understand what high, medium, and low risk indicate. For example, if a woman has a medium risk of maternal mortality, what does it mean?
  5. After the EDA section, can the authors explain any decision made to transform the data? (Standardizing, imputing, or any unit conversions)?
  6. For Figure 3 shows the correlation, it's very difficult to tell which groups have a correlation or not. Perhaps the authors can include a table showing the correlation coefficients to help in interpretation.

Overall, I congratulate the authors on successfully building a model from start to finish, and for delivering the results in an easy-to-follow manner. With a few tweaks, I believe this project meets the standards of a comprehensive analysis of predicting maternal mortality risks using collected health metrics. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

yukunzGIT commented 1 year ago

Data analysis review checklist

Reviewer: yukunzGIT

Reviewer name: Yukun Edward Zhang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Good jobs team 14! It's really interesting to raise a question to study the three different female groups with different risk of maternal mortality based on health measurements features you select and I believe answering this question can definitely have positive impact on medical related research. Here are my suggestions and comments for you porject:

  1. The authors can consider making all their scripts better documented (for example, in download_data.py, add the docstring to the main function you defined; in pre_processing.py, add comments on your code for each part, like # test # export data...).
  2. For the Contribute guidelines, the authors can consider adding methods for others to report the software issues or other problems.
  3. For the EDA Figure 3, the authors can only show the plots for the features SystolicBP and DiastolicBP with high correlation compared to other pairs of predictors and remove the insignificant correlation plots to make it easier for viewers to read.
  4. For each script, the authors can consider removing unused imported modules within that script, like ConfusionMatrixDisplay in fit_maternal_risk_predict_model.py ,and make sure all modules do what they should do. (not sure why the authors comment out the code for create_confusionmatrix in fit_maternal_risk_predict_model.py)
  5. Given the data size is relatively small (<1000 observations for train_df), the authors might consider addressing this in the limitation of the analysis part.
  6. The overall results seem unfinished since the authors comment out the code for create_confusionmatrix and scores on X_test in fit_maternal_risk_predict_model.py. The authors show the mean CV validation test score 0.823 on the best optimized decisiontree model, but how about the overall score on the TEST data? Without knowing these, we might have a big overfitting problem and the optimized decisiontree model may not generalize well in practice.
  7. As the authors mention in their future direction to use other score metrics, here I would suggest maybe first before in milestone 4, at least compare one other score metrics (f1-score is good) with the accuracy metrics you currently use.
  8. The authors can further consider justifying the reasons why using the selected features and the decisiontree model.

Overall, the project conveys clear thought flow and accurate data processing. It's not easy to build a project from ground zero and with some efforts in future updates, I believe this project will provide insightful information and results for the medical society . Nice job, guys!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DanielCairns commented 1 year ago

Data analysis review checklist

Reviewer: <@DanielCairns>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Code quality is very strong

Reproducibility

Analysis report

Estimated hours spent reviewing: 3

Review Comments:

This project is well put together and it is obvious that the group understands the typical data science workflows. It's worth noting that this group only has 3 members instead of 4, but this has not diminished the quality of the submission. I was specifically impressed by the quality of the code across the board. Clear, modular, and easy to follow, plus error handling and function testing! Excellent work! I also appreciate that the group chose a dataset with the potential to bring good to the world - I think it's important to make time to explore data like this when possible.

I struggled to get the automated scripts to run. Unfortunately, download_data.py, eda_script.py, and rendering the final report all failed to run successfully on my Linux Ubuntu machine, even after installing all the listed packages in a new conda environment. I copy pasted directly from the usage section, but did not troubleshoot further if they did not work. I've included the command I ran and any console replies at the end of this comment if you'd like to investigate further.

Regarding your conclusions, I would challenge your assertion that the Decision Tree Classifier is the best model to use here, despite it getting the best cross-validation scores. Taking 6 features to a depth of 29 means we're revisiting the same feature a lot and I suspect the model might suffer from overfitting as a result. There was a large difference between training and validation scores on your decision tree classifier. I was more inclined to choose the SVM model which did not suffer from the same overfitting, even though it cost you some accuracy. Also, as you discussed, we'd probably want to choose a different scoring function anyways which might change the equation again.

Other thoughts I had while reviewing - feel free to take 'em or leave 'em :)

  1. I like the density distribution plots in your EDA a lot. They clearly show that the distribution differs between the classes for almost every feature; you should be able to train a strong classifier as a result.

  2. You should be careful about making claims about the application of your model and findings to women's health in general because the sample is limited to a few years in a specific geographic region (rural Bangladesh).

  3. In the report, you mention "The R programming language was used to perform the analysis", but all of your analysis is done in python as far as I can see.

  4. Somewhere in your report I would have liked to see some indication on the severity of each risk level. How bad is "high", for example. How much worse is high than "mid".

  5. Because these are health related outcomes, you might want to choose a scoring metric that penalizes missing high risk patients more; better to end up with a "overly pessimistic" model than an "optimistic" one.

  6. You have the right idea in the "Future Directions" section, but keep in mind that we can maximize the "recall" of high risk patients by training a dummy model that always predicts "high risk". This is why f1 score is usually the alternative if we're worried about class imbalance and the type 2 errors for a particular class.

  7. Your "standard" classification model doesn't take into account the ordinality of the classes. It is more of a mistake to misclassify a High Risk patient as Low Risk than Medium Risk, and if this happens often (it doesn't on the test set - those are some big red flags for the model). This application might call for a custom scoring function.

  8. Alternatively, you might even consider encoding say 'Low Risk' to 0, 'Med Risk' to 0.5, and 'High Risk' to 1, then training a regression model instead. Obviously since the data isn't set up this way tread with caution, but I like the idea of a continuous prediction scale instead of a discrete one.

Failed download_data.py script

$ python src/download_data.py --out_type='csv' --url='https://archive.ics.uci.edu/ml/machine-learning-databases/00639/Maternal%20Health%20Risk%20Data%20Set.csv' --out_file='data/raw/maternal_risk.csv'
Usage: src/down_data.py --out_type=<out_type> --url=<url> --out_file=<out_file>
Options:
--out_type=<out_type>    Type of file to write locally (script supports either feather or csv)
--url=<url>              URL from where to download the data (must be in standard csv format)
--out_file=<out_file>    Path (including filename) of where to locally write the file

Failed eda_script.py script

$ python src/eda_script.py --data_location='data/raw/maternal_risk.csv' --output_location='src/maternal_risk_eda_figures/'
Usage: src/eda_script.py --data_location=<data_location> --output_location=<output_location>
Options:
--data_location=<data_location>    Location of the data to be used for eda
output_location=<output_location>  Location to output the visulisations

Failed render script

$ Rscript -e "rmarkdown::render('doc/final_report.Rmd')"

processing file: final_report.Rmd
  |.....                                                                 |   7%
  ordinary text without R code

  |.........                                                             |  13%
label: setup (with options) 
List of 1
 $ include: logi FALSE

  |..............                                                        |  20%
  ordinary text without R code

  |...................                                                   |  27%
label: unnamed-chunk-1 (with options) 
List of 2
 $ fig.align: chr "center"
 $ fig.cap  : chr "Figure 1. Counts of observation for each class in train data set"

  |.......................                                               |  33%
  ordinary text without R code

  |............................                                          |  40%
label: unnamed-chunk-2 (with options) 
List of 2
 $ fig.align: chr "center"
 $ fig.cap  : chr "Figure 2. Distribution of training set predictors for high risk, mid risk and low risk"

  |.................................                                     |  47%
  ordinary text without R code

  |.....................................                                 |  53%
label: unnamed-chunk-3 (with options) 
List of 2
 $ fig.align: chr "center"
 $ fig.cap  : chr "Figure 3. Pairwise relationship between predictors"

  |..........................................                            |  60%
  ordinary text without R code

  |...............................................                       |  67%
label: load data
  |...................................................                   |  73%
  ordinary text without R code

  |........................................................              |  80%
label: unnamed-chunk-4 (with options) 
List of 3
 $ fig.align: chr "center"
 $ fig.cap  : chr "Figure 4. Pairwise relationship between predictors"
 $ out.width: chr "50%"

  |.............................................................         |  87%
  ordinary text without R code

  |.................................................................     |  93%
label: confusion_matrix
Quitting from lines 97-99 (final_report.Rmd) 
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed
Calls: <Anonymous> ... eval_with_user_handlers -> eval -> eval -> read.csv -> read.table

Execution halted

Attribution

Based on MDS data analysis checklist, which was derived from the JOSE review checklist and the ROpenSci review checklist.

lennonay commented 1 year ago

Peer review feedback:

Commit 1

Commit 2

Commit 3

Commit 4

Commit 5