DrugCell is an interpretable neural network-based model that predicts cell response to a wide range of drugs. Unlike fully-connected neural networks, connectivity of neurons in the DrugCell mirrors a biological hierarchy (e.g. Gene Ontology), so that the information travels only between subsystems (or pathways) with known hierarchical relationship during the model training. This feature of the framework allows for identification of subsystems in the hierarchy that are important to the model's prediction, warranting further investigation on underlying biological mechanisms of cell response to treatments.
conda>=23.5
The improve project IMPROVE Projectrequires standarized interfaces for data preprocessing, training and inference, follow the code for drugcell in DrugCell
The IMPROVE project is currently using the develop branch
Create environment
conda env create -f drugcell_conda.yml
Activate the environment
conda activate drugcell_python
Download Drugcell
git clone -b develop https://github.com/JDACS4C-IMPROVE/DrugCell.git
cd DrugCell
Install Torch for CUDA and CANDLE package
python3 -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 torchmetrics==0.11.1 --extra-index-url https://download.pytorch.org/whl/cu113
python3 -m pip install git+https://github.com/ECP-CANDLE/candle_lib@develop
**Example usuage without container (running DrugCell)***
Preprocess (optional)
bash preprocess.sh $CUDA_VISIBLE_DEVICES $CANDLE_DATA_DIR
Training
bash train.sh $CUDA_VISIBLE_DEVICES $CANDLE_DATA_DIR
Testing
bash infer.sh $CUDA_VISIBLE_DEVICES $CANDLE_DATA_DIR
pip install --upgrade pip
python3 -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 torchmetrics==0.11.1 --extra-index-url https://download.pytorch.org/whl/cu113
python3 -m pip install networkx
python3 -m pip install git+https://github.com/ECP-CANDLE/candle_lib@develop
git clone -b develop https://github.com/JDACS4C-IMPROVE/DrugCell.git
cd DrugCell
python3 -m pip install -r requirements.txt
chmod a+x *.sh
chmod a+x *.py
sh train.sh 1 data
Model definition file 'DrugCell.def' is located in here
git clone -b develop https://github.com/JDACS4C-IMPROVE/Singularity.git
cd Singularity
Build Singularity
singularity build --fakeroot DrugCell.sif definitions/DrugCell.def
Execute with container
singularity exec --nv DrugCell.sif train.sh $CUDA_VISIBLE_DEVICES $CANDLE_DATA_DIR
DrugCell v1.0 was trained using (cell line, drug) pairs, but it can be generalized to estimate response of any cells to any drugs if:
Pre-trained DrugCell v1.0 model and the drug response data for 509,294 (cell line, drug) pairs used to train the model is shared in http://drugcell.ucsd.edu/downloads.
Required input files:
To load a pre-trained model used for analyses in our manuscript and make prediction for (cell, drug) pairs of your interest, execute the following:
Make sure you have gene2ind.txt, cell2ind.txt, cell2mutation.txt, drug2ind.txt, drug2fingerprint.txt, and your file containing test data in proper format (examples are provided in data and sample folder)
Cell feature files: gene2ind.txt, cell2ind.txt, cell2mutation.txt
Drug feature files: drug2ind.txt, drug2fingerprints.txt
Training data file: _drugcelltrain.txt
Validation data file: _drugcellval.txt
Ontology (hierarchy) file: _drugcellont.txt
A tab-delimited file that contains the ontology (hierarchy) that defines the structure of a branch of a DrugCell model that encodes the genotypes. The first column is always a term (subsystem or pathway), and the second column is a term or a gene. The third column should be set to "default" when the line represents a link between terms, "gene" when the line represents an annotation link between a term and a gene. The following is an example describing a sample hierarchy.
GO:0045834 GO:0045923 default
GO:0045834 GO:0043552 default
GO:0045923 AKT2 gene
GO:0045923 IL1B gene
GO:0043552 PIK3R4 gene
GO:0043552 SRC gene
GO:0043552 FLT1 gene
There are a few optional parameters that you can provide in addition to the input files:
-model: a name of directory where you want to store the trained models. The default is set to "MODEL" in the current working directory.
_-genotypehiddens: a number of neurons to assign each subsystem in the hierarchy. The default is set to 6.
_-drughiddens: a string listing the number of neurons for the drug-encoding branch of DrugCell. The number should be delimited by comma. The default value is "100,50,6", and with the default option, the drug branch of the resulting DrugCell model will be a fully-connected neural network with 3 layers consisting of 100, 50, and 6 neurons.
_-finalhiddens: the number of neurons in the top layer of DrugCell that combines the genotype-encoding and the drug-encoding branches. The default is 6.
-epoch: the number of epoch to run during the training phase. The default is set to 300.
-batchsize: the size of each batch to process at a time. The deafult is set to 5000. You may increase this number to speed up the training process within the memory capacity of your GPU server.
-cuda: the ID of GPU unit that you want to use for the model training. The default setting is to use GPU 0.
Finally, to train a DrugCell model, execute a command line similar to the example provided in _sample/commandlinecuda.sh:
There are three subsets of our training data provided as toy example: drugcell_train.txt, drugcell_test.txt and drugcell_val.txt have 10,000, 1,000, and 1,000 (cell line, drug) pairs along with the corresponding drug response (area under the dose-response curve).