If you make use of this code please cite the Zenodo DOI: https://doi.org/10.5281/zenodo.7062992
This is a set of scripts to get data from rcellminer, convert the data for DrugCell, run DrugCell, and explain the results.
Get data from rcellminer
Replace Cell Line name from nci60 to CCLE
Train, Val, Test data extraction
Data extraction for running DrugCell
Run DrugCell
Get Correlation score for GO terms
Get Correlation score for genes
Evaluate each drug response
(Optional) PCA-based visualization with cytoscape
(Optional) Comaparison with autokeras
(Optional) Counting top 20 correlation of genes and GO terms for all drugs.
(Optional) Classification for missing label's data
(Optional) Hyperparameter tuning
git clone git@github.com:inoue0426/DrugCell.git
unzip ./DrugCell/MODEL/model.pt.zip
mkdir Hidden
mkdir Result
python code/predict_drugcell.py -gene2id ./DrugCell/data_rcellminer/gene2ind.txt \
-cell2id ./DrugCell/data_rcellminer/cell2ind.txt \
-drug2id ./DrugCell/data_rcellminer/drug2ind.txt \
-genotype ./DrugCell/data_rcellminer/cell2mut.txt \
-fingerprint ./DrugCell/data_rcellminer/drug2fingerprint.csv \
-predict ./DrugCell/data_rcellminer/test_DNA.txt \
-hidden ./Hidden \
-result ./Result \
-load ./DrugCell/pretrained_model_rcellminer/model.pt
This is implemented based on RLIPP, the evaluation function for DrugCell. github
Based on the correlation, this code interprets which GO is effective for each drug. The Hidden Layer's value is first obtained for each drug's GO. A ridge regression is performed using this as the feature value and DrugCell's predicted value as y. Afterwards, the predicted value is compared with the predicted value in DrugCell. The correlation between this predicted value and the predicted value of DrugCell helps us determine how well the hidden layer of this GO is performing.
This is the implementation to get the correlation for genes.
Firstly from the graph structure, you can get the path from the gene to GO:0008150. There are so many PATHS to get there, and this code utilizes average to define each gene's correlation.
<!-- get PCA result -->
python code/get_pca_result.py DrugCell/data/rcellminer_test.txt /PATH/To/Hidden/
<!-- get graph structure -->
python code/get_graph_structure.py SAMPLE_INDEX /PATH/To/Hidden/ DrugCell/data/drugcell_ont.txt
<!-- get GO importance using PCA -->
python code/get_GO_importance.py /PATH/To/Hidden/ ./DrugCell/data/test_rcell_over50_not_equal.txt ./DrugCell/data/SMILES_from_PubchemID.txt PUBCHEM_ID
This is the script to get visualizatian and graph structure. This requires these files:
When you run the get_pca_result.py, this concatenates all GO:XXXXXX.hidden for each sample and then runs PCA. So if you run it, you can get a figure as below. Coloring is based on drug response.
When you run the get_graph_structure.py, you can get graph.csv and weight.csv. This requires sample indexes like "100,200,300". This means that from the test data, this code picks up sample indexes 100, 200, and 300 and then calculates the weight of each sample. Concretely, we first construct a graph structure based on Gene Ontology. The data contains GO and gene information and is saved as graph.csv. Next, this obtains weights for each sample from the hidden layer. For example, for three samples of 100, 200, and 300, we obtain weights for 2068 nodes (GOs). Here, since the weights of the genes are not included in the hidden layer, we need to calculate them somehow. To do this, this code first enumerates all the ways to GO:0008150 from each gene. Then add up all the weights of each Gene Ontology we pass when going to GO:0008150, which we defined as the weight for each gene.
The following is a visualization using Cytoscape.
When you run get_GO_importance, you can get importance_for_pos_DRUG_INDEX.csv and importance_for_neg_DRUG_INDEX.csv table. In addition this return below figures, first two pics are figure of top 10 PC1 score GO and bottom10 for positive drug responce score's Cell Line and others are for negative drug responce score's Cell Line.
This script's first step is applying PCA to all Hidden Layer values with n_compounds = 1. There are 2000 GOs and 100,000 samples, so a matrix of 100,000*2000 is generated.
Next step is determining which cell line contains the drug based on the Pubchem ID. Since this data has a Drug Response value, it is divided into Positive and Negative data, depending on whether it is greater or equal to 0.
The PCA values for each GO are then averaged for Positives and Negatives. The Top10 and Bottom10 are sorted and saved for each Positive and Negative.