Closed nasserhashemi closed 3 years ago
Hi Nasser,
thanks for asking for fine-tuning the PPB you can use the seq classification script: https://github.com/PaccMann/paccmann_proteomics/blob/master/scripts/run_seq_clf_script.sh. Since we provide sequence pair and classify them.
I would anyway wait for @mfilipav opinion since he worked on the model training/fine-tuning.
Best, Matteo
Hi Nasser!
Sorry for a delayed reply. Indeed as Matteo pointed out, for protein-protein binding prediction (referred to as PPB in the paper) we used the sequence classification script (see https://github.com/PaccMann/paccmann_proteomics/blob/master/paccmann_proteomics/run_sequence_classification.py) which was launched with a bash script https://github.com/PaccMann/paccmann_proteomics/blob/master/scripts/run_seq_clf_script.sh, and specifying TASK_NAME=pairwise-string
.
In terms of fine-tuning parameters, the model is the same as in pre-training, but of course we have decreased the learning rate and batch size (try 4-32 range), and tried to fine-tune for 3-15 epochs.
I should maybe clarify, that we did not run any "binding site prediction" tasks per se, but rather a protein-protein binding prediction aka PPB, a binary classification task to predict whether two proteins "bind" as specified in STRING DB (see the FAQ section of STRING DB to see how we defined "binding" interactions, http://version10.string-db.org/help/faq/#i-want-to-differentiate-physical-interactions-from-functional-ones-within-string). The dataset for PPB task is constructed from STRING DB. In the future, we should probably rename TASK_NAME=pairwise-string
to TASK_NAME=protein-protein-binding
, sorry for the confusion!
I hope this clarifies your question!
Cheers, Modestas
Thanks so much Modestas and Matteo; It is so helpful; Thanks again for sharing your great job;
Thanks a lot for your interest, closing for the time being, feel free to reopen in case of need.
Hi there; I hope you are well; I have another question; I really appreciate it if you help me with that.
I have two sequences and I want to see whether they bind or not; In this way, I want to use your fine-tuned model; to do that:
afterwards using hugging face I load the model and tokinzer as below: """
from transformers import RobertaTokenizer, RobertaModel import torch
tokenizer = RobertaTokenizer(vocab_file='./string2Seq/checkpoint-166420/vocab.json', merges_file='./string2Seq/checkpoint-166420/merges.txt') model = RobertaModel.from_pretrained(pretrained_model_name_or_path = './string2Seq/checkpoint-166420/pytorch_model.bin', config = './string2Seq/checkpoint-166420/config.json') """
Now, I have difficulties to find the right format of the input. I mean I do not know in which format I have to convert my two sequence to tokenize it and put it in model and how I get the score whether they are bind or not as output.
Thanks in advance
Hi Nasser,
You can simply pass the pair of sequencea to the RobertaTokenozer using the dedicated special tokens like here: https://github.com/huggingface/transformers/blob/34e1bec649112415039f2afe22e38225e88bc453/src/transformers/models/roberta/tokenization_roberta.py#L182 .
An example input, given two dummy proteins, "MKL" and "MKD", would be:
"MKLMKD".
Best, Matteo
Il ven 2 apr 2021, 6:00 AM Nasser Hashemi @.***> ha scritto:
Hi there; I hope you are well; I have another question;
I have two sequence and I want to see whether they bind or not; In this way, I want to use your fine-tuneed model; to do that:
- first I downloaded the saved fine tune model in "public/models/finetuned_string/string2Seq" direcory;
- afterwards using hugging face I load the model and tokinzer as below: """ $from transformers import RobertaTokenizer, RobertaModel $import torch
$tokenizer = RobertaTokenizer(vocab_file='./string2Seq/checkpoint-166420/vocab.json', merges_file='./string2Seq/checkpoint-166420/merges.txt') $model = RobertaModel.from_pretrained(pretrained_model_name_or_path = './string2Seq/checkpoint- 166420/pytorch_model.bin', config = './string2Seq/checkpoint-166420/config.json') """
Now, I have difficulties to find the right format of the input. I mean I do not know in which format I have to convert my two sequence and how I get the score whether they are bind or not.
Thanks in advance
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812304005, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPG554N2HB36ZBICYFTTGU6PTANCNFSM4XMSPU6Q .
Thanks so much Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:
{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again
Hi Nasser, no problem at all if I understood the problem you should pad the sequence to the padding length defined by the model. Modestas can help more with this.
Il ven 2 apr 2021, 5:45 PM Nasser Hashemi @.***> ha scritto:
Thanks Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:
{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812585160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPAITRWPTI5CPDGPFTTTGXRADANCNFSM4XMSPU6Q .
For more info, please see this example fro. The transformers library: https://huggingface.co/transformers/preprocessing.html
Il ven 2 apr 2021, 10:24 PM Matteo Manica @.***> ha scritto:
Hi Nasser, no problem at all if I understood the problem you should pad the sequence to the padding length defined by the model. Modestas can help more with this.
Il ven 2 apr 2021, 5:45 PM Nasser Hashemi @.***> ha scritto:
Thanks Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:
{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812585160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPAITRWPTI5CPDGPFTTTGXRADANCNFSM4XMSPU6Q .
Thanks so much again Matteo, I really appreciate your time; The goal I am trying to do is to use your fine-tuned model as a function which get two sequence as input and gives me the probability of whether they bind or not.
def binding_from_paccman(seq1, seq2): your model return score
If I know the part of your code which you used for testing the binding set in box dir address "public/data/fine_tuning/string/test.tsv", that would be my answer; Sorry, I am some kind of beginner to Transformer area since I just started it a few month ago. Thanks so much again
Hi Nasser, don't be sorry, to create a method like the one you described you have to refactor a bit the logic we use here: https://github.com/PaccMann/paccmann_proteomics/blob/b376883996641a07da77fbbb6dbd34c2c04fdddb/paccmann_proteomics/data/datasets/seq_clf.py#L95, where we convert the paired sequence data into feeatures to be inputted in the model loaded in this way: https://github.com/PaccMann/paccmann_proteomics/blob/b376883996641a07da77fbbb6dbd34c2c04fdddb/paccmann_proteomics/run_sequence_classification.py#L179
Thanks so much Matteo.
Hi, I have a question; in the paper, I could not find which method or algs you used for prediction binding site in fine-tune section; Could you please let me know more info about it?
Thanks so much
Nasser