UniKP

Demo-Preview

PreKcat
Demo-Preview
Table of contents
Installation
Usage
Development
Contribute
- Sponsor
- Adding new features or fixing bugs
License
Footer

Installation

(Back to top)

Notice:

You need install pretrained protein language modoel ProtT5-XL-UniRef50 to generate enzyme representation, the link is provided on ProtT5-XL-U50.
You need install pretrained molecular language modoel SMILES Transformer to generate substrate representation, the link is provided on SMILES Transformer.
You also need install modoel PreKcat_model to predict, the link is provided on PreKcat_model.

other packages:

Python v3.6.9 (Anaconda installation recommended)
PyTorch v1.10.1+cu113
pandas v1.1.5
NumPy v1.19.5

Usage

(Back to top)

Example for how to predict enzyme turnover number from enzyme sequences and substrate structures by language model, UniKP:


import torch
from build_vocab import WordVocab
from pretrain_trfm import TrfmSeq2seq
from utils import split
from transformers import T5EncoderModel, T5Tokenizer
import re
import gc
import numpy as np
import pandas as pd
import pickle
import math

def smiles_to_vec(Smiles): pad_index = 0 unk_index = 1 eos_index = 2 sos_index = 3 mask_index = 4 vocab = WordVocab.load_vocab('vocab.pkl') def get_inputs(sm): seq_len = 220 sm = sm.split() if len(sm)>218: print('SMILES is too long ({:d})'.format(len(sm))) sm = sm[:109]+sm[-109:] ids = [vocab.stoi.get(token, unk_index) for token in sm] ids = [sos_index] + ids + [eos_index] seg = [1]len(ids) padding = [pad_index](seq_len - len(ids)) ids.extend(padding), seg.extend(padding) return ids, seg def get_array(smiles): x_id, x_seg = [], [] for sm in smiles: a,b = get_inputs(sm) x_id.append(a) x_seg.append(b) return torch.tensor(x_id), torch.tensor(x_seg) trfm = TrfmSeq2seq(len(vocab), 256, len(vocab), 4) trfm.load_state_dict(torch.load('trfm_12_23000.pkl')) trfm.eval() x_split = [split(sm) for sm in Smiles] xid, xseg = get_array(x_split) X = trfm.encode(torch.t(xid)) return X

def Seq_to_vec(Sequence): sequences_Example = [] for i in range(len(Sequence)): zj = '' for j in range(len(Sequence[i]) - 1): zj += Sequence[i][j] + ' ' zj += Sequence[i][-1] sequences_Example.append(zj) tokenizer = T5Tokenizer.from_pretrained("prot_t5_xl_uniref50", do_lower_case=False) model = T5EncoderModel.from_pretrained("prot_t5_xl_uniref50") gc.collect() print(torch.cuda.is_available())

'cuda:0' if torch.cuda.is_available() else

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model = model.eval()
features = []
for i in range(len(sequences_Example)):
    print('For sequence ', str(i+1))
    sequences_Example_i = sequences_Example[i]
    sequences_Example_i = [re.sub(r"[UZOB]", "X", sequences_Example_i)]
    ids = tokenizer.batch_encode_plus(sequences_Example_i, add_special_tokens=True, padding=True)
    input_ids = torch.tensor(ids['input_ids']).to(device)
    attention_mask = torch.tensor(ids['attention_mask']).to(device)
    with torch.no_grad():
        embedding = model(input_ids=input_ids, attention_mask=attention_mask)
    embedding = embedding.last_hidden_state.cpu().numpy()
    for seq_num in range(len(embedding)):
        seq_len = (attention_mask[seq_num] == 1).sum()
        seq_emd = embedding[seq_num][:seq_len - 1]
        features.append(seq_emd)
features_normalize = np.zeros([len(features), len(features[0][0])], dtype=float)
for i in range(len(features)):
    for k in range(len(features[0][0])):
        for j in range(len(features[i])):
            features_normalize[i][k] += features[i][j][k]
        features_normalize[i][k] /= len(features[i])
return features_normalize

if name == 'main': sequences = ['MEDIPDTSRPPLKYVKGIPLIKYFAEALESLQDFQAQPDDLLISTYPKSGTTWVSEILDMIYQDGDVEKCRRAPVFIRVPFLEFKA' 'PGIPTGLEVLKDTPAPRLIKTHLPLALLPQTLLDQKVKVVYVARNAKDVAVSYYHFYRMAKVHPDPDTWDSFLEKFMAGEVSYGSW' 'YQHVQEWWELSHTHPVLYLFYEDMKENPKREIQKILKFVGRSLPEETVDLIVQHTSFKEMKNNSMANYTTLSPDIMDHSISAFMRK' 'GISGDWKTTFTVAQNERFDADYAKKMEGCGLSFRTQL'] Smiles = ['OC1=CC=C(CC@@HN)C=C1'] seq_vec = Seq_to_vec(sequences) smiles_vec = smiles_to_vec(Smiles) fused_vector = np.concatenate((smiles_vec, seq_vec), axis=1) with open('PreKcat_new/PreKcat_model.pkl', "rb") as f: model = pickle.load(f) Pre_label = model.predict(fused_vector) Pre_label_pow = [math.pow(10, Pre_label[i]) for i in range(len(Pre_label))] print(len(Pre_label)) res = pd.DataFrame({'sequences': sequences, 'Smiles': Smiles, 'Pre_label': Pre_label}) res.to_excel('PreKcat_predicted_label.xlsx')



<!-- This is optional and it is used to give the user info on how to use the project after installation. This could be added in the Installation section also. -->

# Development
[(Back to top)](#table-of-contents)

<!-- This is the place where you give instructions to developers on how to modify the code.

You could give **instructions in depth** of **how the code works** and how everything is put together.

You could also give specific instructions to how they can setup their development environment.

Ideally, you should keep the README simple. If you need to add more complex explanations, use a wiki. Check out [this wiki](https://github.com/navendu-pottekkat/nsfw-filter/wiki) for inspiration. -->

# Contribute
[(Back to top)](#table-of-contents)

<!-- This is where you can let people know how they can **contribute** to your project. Some of the ways are given below.

Also this shows how you can add subsections within a section. -->

### Sponsor
[(Back to top)](#table-of-contents)

<!-- Your project is gaining traction and it is being used by thousands of people(***with this README there will be even more***). Now it would be a good time to look for people or organisations to sponsor your project. This could be because you are not generating any revenue from your project and you require money for keeping the project alive.

You could add how people can sponsor your project in this section. Add your patreon or GitHub sponsor link here for easy access.

A good idea is to also display the sponsors with their organisation logos or badges to show them your love!(*Someday I will get a sponsor and I can show my love*) -->

### Adding new features or fixing bugs
[(Back to top)](#table-of-contents)

<!-- This is to give people an idea how they can raise issues or feature requests in your projects. 

You could also give guidelines for submitting and issue or a pull request to your project.

Personally and by standard, you should use a [issue template](https://github.com/navendu-pottekkat/nsfw-filter/blob/master/ISSUE_TEMPLATE.md) and a [pull request template](https://github.com/navendu-pottekkat/nsfw-filter/blob/master/PULL_REQ_TEMPLATE.md)(click for examples) so that when a user opens a new issue they could easily format it as per your project guidelines.

You could also add contact details for people to get in touch with you regarding your project. -->

# License
[(Back to top)](#table-of-contents)

<!-- Adding the license to README is a good practice so that people can easily refer to it.

Make sure you have added a LICENSE file in your project folder. **Shortcut:** Click add new file in your root of your repo in GitHub > Set file name to LICENSE > GitHub shows LICENSE templates > Choose the one that best suits your project!

I personally add the name of the license and provide a link to it like below. -->

[GNU General Public License version 3](https://opensource.org/licenses/GPL-3.0)

# Footer
[(Back to top)](#table-of-contents)

<!-- Let's also add a footer because I love footers and also you **can** use this to convey important info.

Let's make it an image because by now you have realised that multimedia in images == cool(*please notice the subtle programming joke). -->
<!-- 
Leave a star in GitHub, give a clap in Medium and share this guide if you found this helpful. -->

<!-- Add the footer here -->

<!-- ![Footer](https://github.com/navendu-pottekkat/awesome-readme/blob/master/fooooooter.png) -->

HanselYu / UniKP

readme

UniKP

Demo-Preview

Table of contents

Installation

Usage

'cuda:0' if torch.cuda.is_available() else