README - Project Proposal

jraza19 commented 3 years ago

Please add your ideas here regarding project question/details to add to the explanation.

rachelywong commented 3 years ago

Question: Given a patient’s demographic and hospital background, can we predict if they will be readmitted to the hospital?

Type: Predictive

Importance: -If performed improperly, diabetes management in hospitalized patients may lead to detrimental outcomes including morbidity and mortality. Assessments of readmission of diabetes patients can be used to create a baseline and outline aspects of potential protocol changes to improve patient outcomes and lower the healthcare costs of readmitting diabetes patients. -something about higher diabetes prevalence in lower-income groups, bad because readmission = higher healthcare costs for them, esp during covid-19

Dataset Information: This dataset was collected from 1998-2008 among 130 US hospitals and integrated delivery networks. 101,766 unique diabetes patient encounters were collected following these specific criteria: (1) It is an inpatient encounter (a hospital admission). (2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. (3) The length of stay was at least 1 day and at most 14 days. (4) Laboratory tests were performed during the encounter. (5) Medications were administered during the encounter.

Features: -55 total features describing the diabetic encounters (could talk about dropping some of them like encounter ID, patient number) -Important features to highlight: -admission type (emergency, urgent, elective, etc.) -age -weight -Diagnosis 1 (primary diagnosis) -Diabetes medications (if there were any diabetic medication prescribed) -There are a mix of categorical, numerical, and binary features. -can talk about transformers for each -imputer for missing values -categorical —> ordinal encoding -binary —> one hot encoding -numerical —> scaling -class imbalance?

jraza19 commented 3 years ago

Team Decision - decided to transform target column to become binary (combined <30, and >30 to re-admitted or not column)

rachelywong commented 3 years ago

Some other thoughts I had that we could add to the proposal:

features we decided to drop and why:

encounter_id
patient_nbr
weight (97% missing)
payer_code (52% missing)
medical_specialty (53% missing)
examide (100% of responses were "NO")
citoglipton (100% of responses were "NO")
race

how we plan to answer our question:

use ML to determine if they will be readmitted or not
classifier
I'm not sure which classifer we should use yet, the paper used a logistic regression (multiclass) so maybe we can use a LR for our two classes, will research more and confirm this
we can do hyperparameter optimization?
possible feature correlations?
avoiding class imbalance by changing target column to binary, also avoids multi classification?
want to see which features may be more likely to predict readmission so we can know where hospitals need to make changes

extra information about the dataset:

This dataset was taken from https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008# of encounters with diabetic patients from 1999-2008 in 130 hospitals across the United States of America. Research from this collected data was used to assess diabetic care during hospitalization and determine if patients were likely to be readmitted or not.
The paper detailing the data collection and research can be found here: https://www.hindawi.com/journals/bmri/2014/781670/
Feature descriptions about the data can be found here: https://www.hindawi.com/journals/bmri/2014/781670/tab1/

sukh2929 commented 3 years ago

I have tried to consolidate the information for the Readme file. Please review and let me know if any changes are required.

Contributors: Javairia Raza, Rachel Wong, Zhiyong Wang, Sukhdeep Kaur

Introduction/Research Question For this project we are trying to answer the question: Given a patient’s demographic and hospital background, can we predict if they will be readmitted to the hospital?. Answering this question is important as assessments of readmission of diabetes patients can be used to create a baseline and outline aspects of potential protocol changes to improve patient outcomes and lower the healthcare costs of readmitting diabetes patients.

Data set Information The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058, and a recipient of the CERNER data. This dataset was collected from 1998-2008 among 130 US hospitals and integrated delivery networks. Dataset can also be found at https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

Exploratory Data Analysis To answer the predictive question aforementioned above, we plan to build a predictive classification model. Before building our model, we will do the exploratory data analysis where we will identify the independence of data, rows having NAs, drop columns which are irrelevant for prediction, the columns correlation, etc. Then, we will partition the data into a training and test set (split 70%:30%) and assess whether there is a strong class imbalance problem that we might need to address.

Selecting the best model There is a mix of categorical, numerical, and binary features for which we will apply transformations to make it numerical features. We plan to use SVM and logistic regression since it is a multi-class problem. After selecting our final model, we will re-fit the model on the entire training data set, and then evaluate its performance on the test data set. At this point, we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. We will report these values as a table in the final report.

References The paper detailing the data collection and research can be found here: https://www.hindawi.com/journals/bmri/2014/781670/ Feature descriptions about the data can be found here: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ This dataset was taken from https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

rachelywong commented 3 years ago

I think this is awesome! Very well put together.

My suggestions:

Introduction: add type of question "For this project we are trying to answer the predictive question:..."
Introduction: add a sentence at the end with "Analysis with machine learning models will identify features more likely to predict patient readmission. This will be important for hospital management, because it will identify areas where changes can be made to decrease patient readmission and healthcare costs."
EDA: add "... rows having NAs or missing values, drop columns ..."
EDA: I split the data into 80% training and 20% testing
EDA: add a sentence describing one table and one figure: "Pandas Profiling will be used to generate feature analysis, and any interactions and correlation between the features to assist in data wrangling. EDA analysis will provide us with a table after data wrangling to show dropped features that are not informative to answering our question. A repeated histogram will be generated to compare numerical features compared against each other, highlighting correlation and potential relationships amongst features and the target."
Selecting the best model: change the first sentence to "There is a mix of categorical, numerical, and binary features for which we will apply proper transformations for use in analysis"
Selecting the best model: change wording to "We plan to test RBF SVM and LR, along with hyperparameter optimization using randomized search.", and we are changing the target column from multi-class to binary-class which will also remove class imbalance so add "There is potential class imbalance based on the target readmitted column having 54864 values of NO, 35545 values of >30, and 11357 values of <30. Class imbalance here can be avoided by changing the readmitted column to binary "YES" or "NO" values if the patient was readmitted or not. This would then give us 54864 "NO" values and 46902 "YES" values, and thus avoiding class imbalance. This will also allow us to perform binary-classification rather than multi-classification."
add a section about "Sharing our results"
Sharing our results: "To share the results of our analysis, we plan to generate figures summarizing our results of model performance (tested against a baseline classifier), and evaluation of features most indicative of patient readmission and features most indicative of no patient readmission. Model performance figures will also show hyperparameter optimization, tested against default hyperparameters. With our analysis and results, we hope that our model will be able to predict patient readmission using deployment data in future analysis, and identify areas for change in hospitalization management."

rachelywong commented 3 years ago

Regarding @jraza19 's combined proposal:

dependencies Python 4.8.3 and Python packages: docopt==0.6.2 urllib3==1.25.11 ChainMap==3.3 os==10.15.6 tarfile==3.3 numpy==1.19.1 pandas==1.1.2 altair==4.1.0 requests==2.24.0 zipfile==3.2.0
for ChainMap and os and tarfile I am not 100% sure
Selecting the best model: I think we should test RBF SVM and LR and pick based on best scores which model to proceed with. RBF SVM would be interesting to test since we can change the support vectors with expert knowledge. Insight from the paper suggest a logistic regression model may be most useful for our question, especially with information about feature correlation (weights will be assigned proportionally) and easier to determine informative features. ^ can add this sentence about testing both and reasoning why
EDA discoveries: add link to EDA https://github.com/UBC-MDS/group29/blob/main/reports/EDA/EDA_initial.ipynb
EDA discoveries: add a sentence describing one table and one figure: "Pandas Profiling will be used to generate feature analysis, and any interactions and correlation between the features to assist in data wrangling. EDA analysis will provide us with a table after data wrangling to show dropped features that are not informative to answering our question. A repeated histogram will be generated to compare numerical features compared against each other, highlighting correlation and potential relationships amongst features and the target."

jraza19 commented 3 years ago

Regarding @jraza19 's combined proposal:

dependencies Python 4.8.3 and Python packages: docopt==0.6.2 urllib3==1.25.11 ChainMap==3.3 os==10.15.6 tarfile==3.3 numpy==1.19.1 pandas==1.1.2 altair==4.1.0 requests==2.24.0 zipfile==3.2.0

for ChainMap and os and tarfile I am not 100% sure

Selecting the best model: I think we should test RBF SVM and LR and pick based on best scores which model to proceed with. RBF SVM would be interesting to test since we can change the support vectors with expert knowledge. Insight from the paper suggest a logistic regression model may be most useful for our question, especially with information about feature correlation (weights will be assigned proportionally) and easier to determine informative features. ^ can add this sentence about testing both and reasoning why

EDA discoveries: add link to EDA https://github.com/UBC-MDS/group29/blob/main/reports/EDA/EDA_initial.ipynb

EDA discoveries: add a sentence describing one table and one figure: "Pandas Profiling will be used to generate feature analysis, and any interactions and correlation between the features to assist in data wrangling. EDA analysis will provide us with a table after data wrangling to show dropped features that are not informative to answering our question. A repeated histogram will be generated to compare numerical features compared against each other, highlighting correlation and potential relationships amongst features and the target."

@rachelywong thank you! I have updated the proposal

UBC-MDS / group29

README - Project Proposal #11