Join us for our webinar on the CV19 Index on Wednesday, April 8th, 2020 from 2:00 – 3:00pm CDT.
With the 1.1.0 release, the CV19 Index can now make predictions for any adult. It is no longer restricted to Medicare populations.
Install | Data Prep | Running The Model | Interpreting Results | Model Performance | Contributing | Release Notes
The COVID-19 Vulnerability Index (CV19 Index) is a predictive model that identifies people who are likely to have a heightened vulnerability to severe complications from COVID-19 (commonly referred to as “The Coronavirus”). The CV19 Index is intended to help hospitals, federal / state / local public health agencies and other healthcare organizations in their work to identify, plan for, respond to, and reduce the impact of COVID-19 in their communities.
Full information on the CV19 Index, including the links to a full FAQ, User Forums, and information about upcoming Webinars is available at http://cv19index.com
This repository provides information for those interested in running the COVID-19 Vulnerability Index on their own data. We provide the index as a pretrained model implemented in Python. We provide the source code, models, and example usage of the CV19 Index.
The CV19 Index utilizes only a few fields which can be extracted from administrative claims or electronic medical records. The data requirements have intentionally been kept very limited in order to facilitate rapid implementation while still providing good predictive power. ClosedLoop is also offering a free, hosted version of the CV19 Index that uses additional data and provides better accuracy. For more information, see http://cv19index.com
The CV19 Index can be installed from PyPI:
pip install cv19index
Notes for windows users: Some Microsoft Windows users have gotten errors when running pip related to installing the SHAP and XGBoost dependencies. For these users we have provided prebuilt wheel files. To use these, download the wheel for SHAP and/or XGBoost to your machine. Then, from the directory where you downloaded the files, run:
pip install xgboost-1.0.2-py3-none-win_amd64.whl pip install shap-0.35.0-cp37-cp37m-win_amd64.whl
These wheel files are for Python 3.7. If you have a different Python version and would like prebuilt binaries, try https://www.lfd.uci.edu/~gohlke/pythonlibs/ . If you still have trouble, please create a GitHub issue.
The CV19 Index requires 2 data files, a demographics file and a claims file. They can be comma-separated value (CSV) or Excel files. The first row is a header file and remaining rows contain the data. In each file, certain columns are used, and any extra columns will be ignored.
The model requires at least 6 months of claims history, so only those members with at least 6 months of prior history should be included. It is not necessary that they have any claims during this period.
Sample input files are in the examples directory. demographics.xlsx and claims.xlsx
The demographics file should contain one row for each person on whom you want to run a prediction.
There are 3 required fields in the demographics file:
The claims file contains a summary of medical claims for each patient. There can be multiple rows for each patient, one per claim. Both inpatient and outpatient claims should be included in the one file. If a patient has no claims, that patient should have no corresponding rows in this file.
There are 6 required fields and several optional fields in the claims file:
personId
from the demographics table.Z79.4
or Z794
Note, if a patient first goes to the emergency room and then is later admitted, both the erVisit
and inpatient
flags should be set to true.
If you need to enter more than 15 diagnosis codes for a claim, you can repeat the row, set the erVisit and inpatient flags to false, and then add in the additional diagnosis codes on the new row.
If you have installed the CV19 Index from PyPI, it will create an executable that you can run. The following command run from the root directory of the GitHub checkout will generate predictions on the example data and put results at examples/predictions.csv
.
Note: The -a 2018-12-31
is only needed because the example data is from 2018. If you are using current data you can omit this argument.
cv19index -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv
We also prove a run_cv19index.py
scripts you can use to generate predictions from Python directly:
python run_cv19index.py -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv
Help is available which provides full details on all of the available options:
python run_cv19index.py -h
The output file created by the CV19 Index contains the predictions along with the explanations of the factors the influenced those predictions.
If you simply want a list of the most vulnerable people, sort the file based on descending prediction. This will give you the population sorted by vulnerability, with the most vulnerable person first.
If you'd like to do more analysis, the predictions file also contains other information, including explanations of which factors most influenced the risk, both positively and negatively.
Here is a sample of the predictions output:
personId | prediction | risk_score | pos_factors_1 | pos_patient_values_1 | pos_shap_scores_1 | ... |
---|---|---|---|---|---|---|
772775338f7ee353 | 0.017149 | 100 | Diagnosis of Pneumonia | True | 0.358 | |
d45d10ed2ec861c4 | 0.008979 | 98 | Diagnosis of Pneumonia | True | 0.264 |
In addition to the personId, the output contains:
There are 3 different versions of the CV19 Index. Each is a different predictive model for the CV19 Index. The models represent different tradeoffs between ease of implementation and overall accuracy. A full description of the creation of these models is available in the accompanying MedRxiv paper, "Building a COVID-19 Vulnerability Index" (http://cv19index.com).
The 3 models are:
Simple Linear - A simple linear logistic regression model that uses only 14 variables. An implementation of this model is included in this package. This model had a 0.731 ROC AUC on our test set. A pickle file containing the parameters for this model is available in the lr.p file.
Open Source ML - An XGBoost model, packaged with this repository, that uses Age, Gender, and 500+ features defined from the CCSR categorization of diagnosis codes. This model had a 0.810 ROC AUC on our test set.
Free Full - An XGBoost model that fully utilizes all the data available in Medicare claims, along with geographically linked public and Social Determinants of Health data. This model provides the highest accuracy of the 3 CV19 Indexes but requires additional linked data and transformations that preclude a straightforward open-source implementation. ClosedLoop is making a free, hosted version of this model available to healthcare organizations. For more information, see http://cv19index.com.
We evaluate the model using a full train/test split. The models are tested on 369,865 individuals. We express model performance using the standard ROC curves, as well as the following metrics:
Model | ROC AUC | Sensitivity as 3% Alert Rate | Sensitivity as 5% Alert Rate |
---|---|---|---|
Logistic Regression | .731 | .214 | .314 |
XGBoost, Diagnosis History + Age | .810 | .234 | .324 |
XGBoost, Full Features | .810 | .251 | .336 |
We are not allowed to share the data used to train the models with our collaborators, but there are tons of ways you can help. If you are interested in participating, just pick up one of the issues marked with the GitHub "help wanted" tag or contact us at covid19-info@closedloop.ai
A few examples are: