Easonhuu / PDI

Predict protein-drug interaction
3 stars 0 forks source link

PDI

A paper accepted by the 2020 BIBM conference, which is to achieve high precision in protein-drug interaction by using GAT and Transformer.
Source code and dataset for "Structure Enhanced Protein-Drug Interaction Prediction using Transformer and Graph Embedding"

Reqirements:

Location: ./data/total_modified_enzy_seqs-v2.csv
Format: 1.9G, 4168590 lines  
Accession_Code Recommended Name EC Number Organism Source No of amino acids Sequence
1 P15807 NaN 1 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) Swiss-Prot 274 MVKSLQLAHQLKDKKILLIGGGEVGLTRLYKLI...
2 Q5SRE7 NaN 1 Homo sapiens Swiss-Prot 291 MACLSPSQLQKFQQDGFLVLEGFLSAEECVAMQQRI...
We just need to use the last column (protein sequences), getting sequences by ./pre-training/get_sequences.py and saving in ./data/corpus-enz416w.big
Format: 3.0G, 4168590 lines
Protein sequences
M V K S L Q L A H Q L K D K K I L L I G G G E V G L T R L Y K L I P T G C K L T L V S...
M A C L S P S Q L Q K F Q Q D G F L V L E G F L S A E E C V A M Q Q R I G E I V A E M D V...

The data is divided into train-set and dev-set by ./pre-training/train_dev_split.py, saving in ./data/corpus-enz416w-train.tsv and ./data/corpus-enz416w-dev.tsv.
Next, we need to generate protein sequences's vocabulary by ./pre-training/generate_vocab.py, saving in ./pre-training/protein_vocab.txt

Fine-tuning phase:

  PDB-ID Affinity-Value seq rdkit_smiles set contact_map
0 3zzf 0.4 NGFSATRSTV... CC(=O)NC@@HC(=O)O train [array([[ True, True, True, ..., False, False, False], ..., [False, False, False, ..., True, True, True]])]]
1 11gs 5.82 PYTVVY... CC[C@H](C(=O)...,CC[C@@H]... train [array([[ True, True, True, ..., False, False, False], ..., [False, False, False, ..., True, True, True]])]

Model architecture

model

Cite:

If you use the code, please cite this paper:

Hu, Fan, et al. "Structure Enhanced Protein-Drug Interaction Prediction using Transformer and Graph Embedding." 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020.