agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.1k stars 152 forks source link

Multi-Mask Inference in ProtBert via HF Pipeline Function #116

Closed gundalav closed 11 months ago

gundalav commented 1 year ago

Hi Michael,

I hope this message finds you well.

I've been experimenting with a slight modification of your original code, which can be found on the Hugging Face Model Hub under "Rostlab/prot_bert". Here is the variant I have been working with:

from transformers import BertForMaskedLM, BertTokenizer, pipeline

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)

unmasker('D [MASK] I [MASK] T S')

I took the liberty to introduce multiple masks within the input sequence 'D [MASK] I [MASK] T S'. However, upon running the model, I realized that the masks were seemingly inferred independently.

The results provided were as follows:

[...
{'score': 0.08988600969314575, 'token': 10, 'token_str': 'S', 'sequence': '[CLS] D S I [MASK] T S [SEP]'},
...
{'score': 0.05809782072901726, 'token': 8, 'token_str': 'V', 'sequence': '[CLS] D [MASK] I V T S [SEP]'}
...]

From the results, it appears that for the prediction 'V' for the second [MASK] in the sequence '[CLS] D [MASK] I V T S [SEP]', it did not incorporate the result from the first [MASK] ('S') which had been inferred earlier in the sequence '[CLS] D S I [MASK] T S [SEP]'.

I was wondering if this inference of each mask is indeed independent of others. If that is the case, I am keen to learn whether there is an approach that would allow for simultaneous prediction of all the masks, rather than handling them one at a time. Any insights or pointers would be greatly appreciated.

Thank you for your time and consideration. I look forward to your response.

Best regards,

G.V.

mheinzinger commented 1 year ago

Hi Gundalav, Sorry somehow your issue slipped through; first of all: I only have little experience with the automated pipelines of HF so I can not rule out that there are some undesired side-effects attached to it. From a plain model perspective, ignoring the pipeline, you are right: all [MASK] tokens introduced in a sequence should be inferred simultaneously. So if you feed in a sequence of Length 6 (as you did), you should get 6 output logits, one per token (even for those that were not masked). From those logits you should then be able to derive the probability of the most likely token at this position. For not-masked tokens this is usually just copy&paste of the input (though not always! - due to (Prot)BERT's pre-training, you can have cases where a token is not replaced by a [MASK] but rather by another randomly chosen token or simply by itself), but for masked tokens you should get a probability of the most likely amino acid at this position given non-masked context.