Chaoyu Zhu, Xiaoqiong Xia, Nan Li, Fan Zhong, Zhihao Yang, Lei Liu. RDKG-115: Assisting drug repurposing and discovery for rare diseases by trimodal knowledge graph embedding. Computers in Biology and Medicine, 2023, 107262.
Rare diseases (RDs) may affect individuals in small numbers, but they have a significant impact on a global scale. Accurate diagnosis of RDs is challenging, and there is a severe lack of drugs available for treatment. Pharmaceutical companies have shown a preference for drug repurposing from existing drugs developed for other diseases due to the high investment, high risk, and long cycle involved in RD drug development. Compared to traditional approaches, knowledge graph embedding (KGE) based methods are more efficient and convenient, as they treat drug repurposing as a link prediction task. KGE models allow for the enrichment of existing knowledge by incorporating multimodal information from various sources. In this study, we constructed RDKG-115, a rare disease knowledge graph involving 115 RDs, composed of 35,643 entities, 25 relations, and 5,539,839 refined triplets, based on 372,384 high-quality literature and 4 biomedical datasets: DRKG, Pathway Commons, PharmKG, and PMapp. Subsequently, we developed a trimodal KGE model containing structure, category, and description embeddings using reverse-hyperplane projection. We utilized this model to infer 4,199 reliable new inferred triplets from RDKG-115. Finally, we calculated potential drugs and small molecules for each of the 115 RDs, taking multiple sclerosis as a case study. This study provides a paradigm for large-scale screening of drug repurposing and discovery for RDs, which will speed up the drug development process and ultimately benefit patients with RDs.
E_dict.json : json file of standard entity set
Models.py : TransE, TransH, ConvKB, RotatE structure
run_kge.py : Running and configuration file
train_kge.py : Class of processing and tool functions for Knowledge Graph Embedding
C_dict.data : Dict of entity category annotation
get_c_dict.py : Run it to get C_dict.data
BERT.py : Structure of BERT
DTModel.py : Structure for training description table
optimization.py : Training optimization of BERT
run_d_Table.py : Run it to get D_table.data
tokenization.py : Tokenization function of BERT
train_d_Table.py : Class of processing and tool functions for D_table
entity.csv : Entities of RDKG-115
relation.csv : Relations of RDKG-115
train.csv : Train set of RDKG-115 (98%)
dev.csv : Dev set of RDKG-115 (1%)
test.csv : Test set of RDKG-115 (1%)
entity_name_map.xlsx : Entity key-name mapping file
RDKG-115.csv : Raw RDKG-115
RDKG-115 (plus inferred).csv : RDKG-115 merge with reliable new inferred knowledges
(1) TransE: Translating Embeddings for Modeling Multi-relational Data
(2) TransH: Knowledge Graph Embedding by Translating on Hyperplanes
(3) ConvKB: A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network.
(4) RotatE: ROTATE: KNOWLEDGE GRAPH EMBEDDING BY RELATIONAL ROTATION IN COMPLEX SPACE
(5) BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Code: https://github.com/google-research/bert)
(6) BioBERT: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
(7) PubMedBERT: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
(8) SCIBERT: SCIBERT: A Pretrained Language Model for Scientific Text
(1) python 3.6.2
(2) torch 1.9.1+cu111
(3) numpy 1.19.2
(1) Run get_c_dict.py to get C_dict.data in Model/C/ (Already in the folder, you can not run)
python get_c_dict.py
(2) Run run_d_table.py to get D_table.data in Model/D/
python run_d_table.py --model pubmedbert --len_d 150 --l_r 1e-5 --batch_size 8 --epoches 5 --earlystop 1
(3) Run run_kge.py to train TransE, TransH, ConvKB and RotatE in Model/
lanta_c == 0 and lanta_d == 0 : S
lanta_c != 0 and lanta_d == 0 : S + C
lanta_c == 0 and lanta_d != 0 : S + D
lanta_c != 0 and lanta_d != 0 : S + C + D
TransE
python run_kge.py --model TransE --margin 1.0 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 800
python run_kge.py --model TransE --margin 1.0 --lanta_c 0.5 --lanta_d 0.0 --l_r 5e-4 --epoches 200
python run_kge.py --model TransE --margin 1.0 --lanta_c 0.0 --lanta_d 0.5 --l_r 5e-4 --epoches 200
python run_kge.py --model TransE --margin 1.0 --lanta_c 0.1 --lanta_d 0.5 --l_r 5e-4 --epoches 200
TransH
python run_kge.py --model TransH --margin 1.0 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 800
python run_kge.py --model TransH --margin 1.0 --lanta_c 0.3 --lanta_d 0.0 --l_r 5e-4 --epoches 200
python run_kge.py --model TransH --margin 1.0 --lanta_c 0.0 --lanta_d 0.5 --l_r 5e-4 --epoches 200
python run_kge.py --model TransH --margin 1.0 --lanta_c 0.3 --lanta_d 0.5 --l_r 5e-4 --epoches 200
ConvKB
python run_kge.py --model ConvKB --margin 2.0 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 800
python run_kge.py --model ConvKB --margin 2.0 --lanta_c 0.3 --lanta_d 0.0 --l_r 5e-4 --epoches 200
python run_kge.py --model ConvKB --margin 2.0 --lanta_c 0.0 --lanta_d 0.1 --l_r 5e-4 --epoches 200
python run_kge.py --model ConvKB --margin 2.0 --lanta_c 0.3 --lanta_d 0.5 --l_r 5e-4 --epoches 200
RotatE
python run_kge.py --model RotatE --margin 3.0 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 800
python run_kge.py --model RotatE --margin 3.0 --lanta_c 0.3 --lanta_d 0.0 --l_r 5e-4 --epoches 200
python run_kge.py --model RotatE --margin 3.0 --lanta_c 0.0 --lanta_d 0.3 --l_r 5e-4 --epoches 200
python run_kge.py --model RotatE --margin 3.0 --lanta_c 0.1 --lanta_d 0.5 --l_r 5e-4 --epoches 200