PaccMann / paccmann_proteomics

PaccMann models for protein language modeling
MIT License
37 stars 8 forks source link

what method you used in fine-tuning the PPB? #3

Closed nasserhashemi closed 3 years ago

nasserhashemi commented 3 years ago

Hi, I have a question; in the paper, I could not find which method or algs you used for prediction binding site in fine-tune section; Could you please let me know more info about it?

Thanks so much

Nasser

drugilsberg commented 3 years ago

Hi Nasser,

thanks for asking for fine-tuning the PPB you can use the seq classification script: https://github.com/PaccMann/paccmann_proteomics/blob/master/scripts/run_seq_clf_script.sh. Since we provide sequence pair and classify them.

I would anyway wait for @mfilipav opinion since he worked on the model training/fine-tuning.

Best, Matteo

mfilipav commented 3 years ago

Hi Nasser!

Sorry for a delayed reply. Indeed as Matteo pointed out, for protein-protein binding prediction (referred to as PPB in the paper) we used the sequence classification script (see https://github.com/PaccMann/paccmann_proteomics/blob/master/paccmann_proteomics/run_sequence_classification.py) which was launched with a bash script https://github.com/PaccMann/paccmann_proteomics/blob/master/scripts/run_seq_clf_script.sh, and specifying TASK_NAME=pairwise-string.

In terms of fine-tuning parameters, the model is the same as in pre-training, but of course we have decreased the learning rate and batch size (try 4-32 range), and tried to fine-tune for 3-15 epochs.

I should maybe clarify, that we did not run any "binding site prediction" tasks per se, but rather a protein-protein binding prediction aka PPB, a binary classification task to predict whether two proteins "bind" as specified in STRING DB (see the FAQ section of STRING DB to see how we defined "binding" interactions, http://version10.string-db.org/help/faq/#i-want-to-differentiate-physical-interactions-from-functional-ones-within-string). The dataset for PPB task is constructed from STRING DB. In the future, we should probably rename TASK_NAME=pairwise-string to TASK_NAME=protein-protein-binding, sorry for the confusion!

I hope this clarifies your question!

Cheers, Modestas

nasserhashemi commented 3 years ago

Thanks so much Modestas and Matteo; It is so helpful; Thanks again for sharing your great job;

drugilsberg commented 3 years ago

Thanks a lot for your interest, closing for the time being, feel free to reopen in case of need.

nasserhashemi commented 3 years ago

Hi there; I hope you are well; I have another question; I really appreciate it if you help me with that.

I have two sequences and I want to see whether they bind or not; In this way, I want to use your fine-tuned model; to do that:

  1. first I downloaded the saved fine tune model in "public/models/finetuned_string/string2Seq" direcory;
  2. afterwards using hugging face I load the model and tokinzer as below: """

    from transformers import RobertaTokenizer, RobertaModel import torch

    tokenizer = RobertaTokenizer(vocab_file='./string2Seq/checkpoint-166420/vocab.json', merges_file='./string2Seq/checkpoint-166420/merges.txt') model = RobertaModel.from_pretrained(pretrained_model_name_or_path = './string2Seq/checkpoint-166420/pytorch_model.bin', config = './string2Seq/checkpoint-166420/config.json') """

Now, I have difficulties to find the right format of the input. I mean I do not know in which format I have to convert my two sequence to tokenize it and put it in model and how I get the score whether they are bind or not as output.

Thanks in advance

drugilsberg commented 3 years ago

Hi Nasser,

You can simply pass the pair of sequencea to the RobertaTokenozer using the dedicated special tokens like here: https://github.com/huggingface/transformers/blob/34e1bec649112415039f2afe22e38225e88bc453/src/transformers/models/roberta/tokenization_roberta.py#L182 .

An example input, given two dummy proteins, "MKL" and "MKD", would be: "MKLMKD".

Best, Matteo

Il ven 2 apr 2021, 6:00 AM Nasser Hashemi @.***> ha scritto:

Hi there; I hope you are well; I have another question;

I have two sequence and I want to see whether they bind or not; In this way, I want to use your fine-tuneed model; to do that:

  1. first I downloaded the saved fine tune model in "public/models/finetuned_string/string2Seq" direcory;
  2. afterwards using hugging face I load the model and tokinzer as below: """ $from transformers import RobertaTokenizer, RobertaModel $import torch

$tokenizer = RobertaTokenizer(vocab_file='./string2Seq/checkpoint-166420/vocab.json', merges_file='./string2Seq/checkpoint-166420/merges.txt') $model = RobertaModel.from_pretrained(pretrained_model_name_or_path = './string2Seq/checkpoint- 166420/pytorch_model.bin', config = './string2Seq/checkpoint-166420/config.json') """

Now, I have difficulties to find the right format of the input. I mean I do not know in which format I have to convert my two sequence and how I get the score whether they are bind or not.

Thanks in advance

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812304005, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPG554N2HB36ZBICYFTTGU6PTANCNFSM4XMSPU6Q .

nasserhashemi commented 3 years ago

Thanks so much Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:

{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again

drugilsberg commented 3 years ago

Hi Nasser, no problem at all if I understood the problem you should pad the sequence to the padding length defined by the model. Modestas can help more with this.

Il ven 2 apr 2021, 5:45 PM Nasser Hashemi @.***> ha scritto:

Thanks Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:

{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812585160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPAITRWPTI5CPDGPFTTTGXRADANCNFSM4XMSPU6Q .

drugilsberg commented 3 years ago

For more info, please see this example fro. The transformers library: https://huggingface.co/transformers/preprocessing.html

Il ven 2 apr 2021, 10:24 PM Matteo Manica @.***> ha scritto:

Hi Nasser, no problem at all if I understood the problem you should pad the sequence to the padding length defined by the model. Modestas can help more with this.

Il ven 2 apr 2021, 5:45 PM Nasser Hashemi @.***> ha scritto:

Thanks Matteo for your prompt reply; Actually lets do the dummy example you gave me; when I tokinzed it, I get this from tokonizer:

{'input_ids': [0, 32, 87, 34, 2594, 225, 32, 19, 87, 34, 32, 19, 87, 34, 9415, 225, 32, 19, 87, 34, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

However, I still have difficulties to put it in the model; I mean could you please let me know how I should get the score of binding between those two dummy sequence? Thanks so much again

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PaccMann/paccmann_proteomics/issues/3#issuecomment-812585160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJIBPAITRWPTI5CPDGPFTTTGXRADANCNFSM4XMSPU6Q .

nasserhashemi commented 3 years ago

Thanks so much again Matteo, I really appreciate your time; The goal I am trying to do is to use your fine-tuned model as a function which get two sequence as input and gives me the probability of whether they bind or not.

def binding_from_paccman(seq1, seq2): your model return score

If I know the part of your code which you used for testing the binding set in box dir address "public/data/fine_tuning/string/test.tsv", that would be my answer; Sorry, I am some kind of beginner to Transformer area since I just started it a few month ago. Thanks so much again

drugilsberg commented 3 years ago

Hi Nasser, don't be sorry, to create a method like the one you described you have to refactor a bit the logic we use here: https://github.com/PaccMann/paccmann_proteomics/blob/b376883996641a07da77fbbb6dbd34c2c04fdddb/paccmann_proteomics/data/datasets/seq_clf.py#L95, where we convert the paired sequence data into feeatures to be inputted in the model loaded in this way: https://github.com/PaccMann/paccmann_proteomics/blob/b376883996641a07da77fbbb6dbd34c2c04fdddb/paccmann_proteomics/run_sequence_classification.py#L179

nasserhashemi commented 3 years ago

Thanks so much Matteo.