UBC-MDS / group29

Project Repo for Group 29 for DSCI 522
MIT License
0 stars 9 forks source link

README - Project Proposal #11

Closed jraza19 closed 3 years ago

jraza19 commented 3 years ago

Please add your ideas here regarding project question/details to add to the explanation.

rachelywong commented 3 years ago

Question: Given a patient’s demographic and hospital background, can we predict if they will be readmitted to the hospital?

Type: Predictive

Importance: -If performed improperly, diabetes management in hospitalized patients may lead to detrimental outcomes including morbidity and mortality. Assessments of readmission of diabetes patients can be used to create a baseline and outline aspects of potential protocol changes to improve patient outcomes and lower the healthcare costs of readmitting diabetes patients. -something about higher diabetes prevalence in lower-income groups, bad because readmission = higher healthcare costs for them, esp during covid-19

Dataset Information: This dataset was collected from 1998-2008 among 130 US hospitals and integrated delivery networks. 101,766 unique diabetes patient encounters were collected following these specific criteria: (1) It is an inpatient encounter (a hospital admission). (2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. (3) The length of stay was at least 1 day and at most 14 days. (4) Laboratory tests were performed during the encounter. (5) Medications were administered during the encounter.

Features: -55 total features describing the diabetic encounters (could talk about dropping some of them like encounter ID, patient number) -Important features to highlight: -admission type (emergency, urgent, elective, etc.) -age -weight -Diagnosis 1 (primary diagnosis) -Diabetes medications (if there were any diabetic medication prescribed) -There are a mix of categorical, numerical, and binary features. -can talk about transformers for each -imputer for missing values -categorical —> ordinal encoding -binary —> one hot encoding -numerical —> scaling -class imbalance?

jraza19 commented 3 years ago

Team Decision - decided to transform target column to become binary (combined <30, and >30 to re-admitted or not column)

rachelywong commented 3 years ago

Some other thoughts I had that we could add to the proposal:

features we decided to drop and why:

how we plan to answer our question:

extra information about the dataset:

sukh2929 commented 3 years ago

I have tried to consolidate the information for the Readme file. Please review and let me know if any changes are required.

Contributors: Javairia Raza, Rachel Wong, Zhiyong Wang, Sukhdeep Kaur

Introduction/Research Question For this project we are trying to answer the question: Given a patient’s demographic and hospital background, can we predict if they will be readmitted to the hospital?. Answering this question is important as assessments of readmission of diabetes patients can be used to create a baseline and outline aspects of potential protocol changes to improve patient outcomes and lower the healthcare costs of readmitting diabetes patients.

Data set Information The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058, and a recipient of the CERNER data. This dataset was collected from 1998-2008 among 130 US hospitals and integrated delivery networks. Dataset can also be found at https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

Exploratory Data Analysis To answer the predictive question aforementioned above, we plan to build a predictive classification model. Before building our model, we will do the exploratory data analysis where we will identify the independence of data, rows having NAs, drop columns which are irrelevant for prediction, the columns correlation, etc. Then, we will partition the data into a training and test set (split 70%:30%) and assess whether there is a strong class imbalance problem that we might need to address.

Selecting the best model There is a mix of categorical, numerical, and binary features for which we will apply transformations to make it numerical features. We plan to use SVM and logistic regression since it is a multi-class problem. After selecting our final model, we will re-fit the model on the entire training data set, and then evaluate its performance on the test data set. At this point, we will look at overall accuracy as well as misclassification errors (from the confusion matrix) to assess prediction performance. We will report these values as a table in the final report.

References The paper detailing the data collection and research can be found here: https://www.hindawi.com/journals/bmri/2014/781670/ Feature descriptions about the data can be found here: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ This dataset was taken from https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

rachelywong commented 3 years ago

I think this is awesome! Very well put together.

My suggestions:

rachelywong commented 3 years ago

Regarding @jraza19 's combined proposal:

jraza19 commented 3 years ago

Regarding @jraza19 's combined proposal:

  • dependencies Python 4.8.3 and Python packages: docopt==0.6.2 urllib3==1.25.11 ChainMap==3.3 os==10.15.6 tarfile==3.3 numpy==1.19.1 pandas==1.1.2 altair==4.1.0 requests==2.24.0 zipfile==3.2.0
  • for ChainMap and os and tarfile I am not 100% sure
  • Selecting the best model: I think we should test RBF SVM and LR and pick based on best scores which model to proceed with. RBF SVM would be interesting to test since we can change the support vectors with expert knowledge. Insight from the paper suggest a logistic regression model may be most useful for our question, especially with information about feature correlation (weights will be assigned proportionally) and easier to determine informative features. ^ can add this sentence about testing both and reasoning why
  • EDA discoveries: add link to EDA https://github.com/UBC-MDS/group29/blob/main/reports/EDA/EDA_initial.ipynb
  • EDA discoveries: add a sentence describing one table and one figure: "Pandas Profiling will be used to generate feature analysis, and any interactions and correlation between the features to assist in data wrangling. EDA analysis will provide us with a table after data wrangling to show dropped features that are not informative to answering our question. A repeated histogram will be generated to compare numerical features compared against each other, highlighting correlation and potential relationships amongst features and the target."

@rachelywong thank you! I have updated the proposal