Optum / retain-keras

Reimplementation of RETAIN Recurrent Neural Network in Keras
Apache License 2.0
83 stars 31 forks source link

RETAIN-Keras: Keras reimplementation of RETAIN

RETAIN is a neural network architecture originally introduced by Edward Choi that enables the creations of highly interpretable Recurrent Neural Network models for patient diagnosis without any loss in model performance. This repository holds the keras reimplementation of RETAIN (originally in Theano) that allows for flexible modifications to the original code, introduces multiple new features, and increases the speed of training. RETAIN has shown to be highly effective for creating predictive models for a multitude of conditions and we are excited to share this implementation to the broader healthcare data science community.

Improvements and Extra Features

Installing RETAIN-Keras and Building the Environment

To run the scripts in this repository, create a Python 3.7.9 virtual environment and install the dependencies in requirements.txt. We recommend using Anaconda to create your environment with the following commands:

git clone https://github.com/Optum/retain-keras.git
conda create --name=retain python=3.7.9
conda activate retain
pip install -r requirements.txt

Running the code

Training Arguments

The retain_train.py script will train the RETAIN model and evaluate/save it after each epoch. The script has multiple arguments to customize the training and model:

Evaluation Arguments

The retain_evaluation.py script will evaluate the specific RETAIN model and create some sample graphs. Arguments include:

Interpretation Arguments

The retain_interpretations.py script will compute probabilities for all patients and then will allow the user to select patients by ID to see specific risk scores and interpret visits (displayed as pandas dataframes). It is highly recommended to extract this script to a notebook to enable more dynamic interaction. Arguments include:

Data and Target Format

By default the data has to be saved as a pickled pandas dataframe with the following format:

By default the target has to be saved as a pickled pandas dataframe with the following format:

Sample Data Generation Using MIMIC-III

You can quickly test this reimplementation by creating a sample dataset from MIMIC-III data using the process_mimic_modified.py script. You will need to request access to MIMIC-III, a de-identified database containing information about clinical care of patients for 11 years of data, to be able to run this script. If you do not wish to request access to the full data, you can freely download the MIMIC-III sample demo data and use it for exploratory benchmarks. The process_mimic_modified.py script heavily borrows from the original process_mimic.py script created by Edward Choi but is modified to output data in a format specified above. It outputs the necessary files to a user-specified directory and splits them into train and test by a user-specified ratio.

Example:

Run from the MIMIC-III directory. This will split data with 70% going to training and 30% to test:

python process_mimic_modified.py ADMISSIONS.csv DIAGNOSES_ICD.csv PATIENTS.csv data .7

License

Please review the license, notice and other documents before using the code in this repository or making a contribution to the repository

Contributing

To contribute features, bug fixes, tests, examples, or documentation, please submit a pull request with a description of your proposed changes or additions.

Please include a brief description of your pull request when submitting code and ensure that your code follows the Pep 8 style guide. To do this run pip install black and black retain-keras to reformat files within your copy of the code using the black code formatter. The black code formatter is a PEP 8 compliant, opinionated formatter that reformats entire files in place. You can also use the autopep8 code formatter within your IDE to ensure Pep 8 compliance.

Code style: black

References

  1. Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, Jimeng Sun, 2016, RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism, In Proc. of Neural Information Processing Systems (NIPS) 2016, pp.3504-3512. https://github.com/mp2893/retain

  2. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full]; 2000 (June 13).