A paper accepted by the 2020 BIBM conference, which is to achieve high precision in protein-drug interaction by using GAT and Transformer.
Source code and dataset for "Structure Enhanced Protein-Drug Interaction Prediction using Transformer and Graph Embedding"
cat /usr/local/cuda/version.txt
)Location: ./data/total_modified_enzy_seqs-v2.csv Format: 1.9G, 4168590 lines |
Accession_Code | Recommended Name | EC Number | Organism | Source | No of amino acids | Sequence |
---|---|---|---|---|---|---|---|
1 | P15807 | NaN | 1 | Saccharomyces cerevisiae (strain ATCC 204508 / S288c) | Swiss-Prot | 274 | MVKSLQLAHQLKDKKILLIGGGEVGLTRLYKLI... |
2 | Q5SRE7 | NaN | 1 | Homo sapiens | Swiss-Prot | 291 | MACLSPSQLQKFQQDGFLVLEGFLSAEECVAMQQRI... |
We just need to use the last column (protein sequences), getting sequences by ./pre-training/get_sequences.py and saving in ./data/corpus-enz416w.big Format: 3.0G, 4168590 lines |
Protein sequences |
---|---|
M V K S L Q L A H Q L K D K K I L L I G G G E V G L T R L Y K L I P T G C K L T L V S... | |
M A C L S P S Q L Q K F Q Q D G F L V L E G F L S A E E C V A M Q Q R I G E I V A E M D V... |
The data is divided into train-set and dev-set by ./pre-training/train_dev_split.py
, saving in ./data/corpus-enz416w-train.tsv
and ./data/corpus-enz416w-dev.tsv
.
Next, we need to generate protein sequences's vocabulary by ./pre-training/generate_vocab.py
, saving in ./pre-training/protein_vocab.txt
./data/pdbbind2016.pkl
PDB-ID | Affinity-Value | seq | rdkit_smiles | set | contact_map | |
---|---|---|---|---|---|---|
0 | 3zzf | 0.4 | NGFSATRSTV... | CC(=O)NC@@HC(=O)O | train | [array([[ True, True, True, ..., False, False, False], ..., [False, False, False, ..., True, True, True]])]] |
1 | 11gs | 5.82 | PYTVVY... | CC[C@H](C(=O)...,CC[C@@H]... | train | [array([[ True, True, True, ..., False, False, False], ..., [False, False, False, ..., True, True, True]])] |
If you use the code, please cite this paper:
Hu, Fan, et al. "Structure Enhanced Protein-Drug Interaction Prediction using Transformer and Graph Embedding." 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020.