This Python package is for building and evaluating large general pre-trained models on data in the OMOP Common Data Model (CDM) format. The models are fitted on the structured data (concepts) in the CDM, not any natural language. We aim to evaluate these models on various tasks, such as patient-level prediction (either zero-shot or fine-tuned).
This package assumes the GeneralPretrainedModelTools R package has been executed to retrieve (a sample of) the CDM data to local Parquet files. After this, a 'cdm_processor' must be run to convert the data to sequence data suitable for a large language model. TODO: how to go from here.
The project is built in python 3.10, and project dependency needs to be installed
Create a new Python virtual environment
python -m venv venv;
source venv/bin/activate;
Install the packages in requirements.txt
pip install -r requirements.txt
In real-world applications, the CDM data can be retrieved from a database using the GeneralPretrainedModelTools R package. For testing purposes, we can simulate CDM data using a built-in simulator:
Edit simulator.ini so the root_folder
argument points to a folder on the local file system.
Run:
PYTHONPATH=./: python simulating/simulator.py simulator.ini
By default, the simulation script will generate pretraining data in a subfolder called 'pretraining'.
In addition, data will be generated for a patient-level prediction task, where patient data up to an index date is used to predict whether a patient will have a certain condition in the prediction window (default = 365 days) after the index date. Training data, for fine-tuning the pretrained model, will be generated in a subfolder called 'train'. Test data, for evaluating the fine-tuned model, will be generated in a subfolder called 'test'. In both 'train' and 'test' folders, subfolders will be generated for a subset of simulated concept IDs with labels indicating whether the concept was observed in the prediction window.
Edit cdm_processor.ini to point to folders on the local file system, e.g. the 'pretraining' folder generated by the simulation script.
Run:
PYTHONPATH=./: python cdm_processing/cdm_processor.py cdm_processor.ini
Edit model_trainer.ini to point to folder on the local file system, e.g. the 'patient_sequence' folder generated by the CDM processing script.
Run:
PYTHONPATH=./: python training/train_model.py model_trainer.ini
On macOS, you may need to set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1
to avoid an error.
Apollo is licensed under Apache License 2.0.
Under development. Do not use.