agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Embedding to sequence? #72

Closed fcharih closed 2 years ago

fcharih commented 2 years ago

Hello,

I apologize if this has been answered before. I was wondering whether it was possible to use your model to convert an embedding obtained with the encoder to an amino acid sequence? If not, then it might be necessary to train a decoder separately to achieve this, I assume.

Cheers, Francois

mheinzinger commented 2 years ago

Hello Francois, the problem is that there is no such thing as a universal decoder which can map any protein embedding back to it's original sequence. This decoder will always be model/encoder-specific. Still, depending on your task, you might want to check our ProtT5 which is an Encoder-Decoder model. So the Encoder first maps your protein sequence to embedding space and the Decoder uses this embedding afterwards to reconstruct the original amino acid sequence from the Encoder's embedding. Besides this, if you already have some embeddings and would like to map them back to amino acid sequence, most of our language models are trained via masked-language modeling (e.g. ProtBERT). This means that you can use the last classification layer of those models to map from embedding space back to amino acid space (so the models themself already have a classification layer which map from embedding to amino acid sequence).

fcharih commented 2 years ago

Thanks for the insight. Very much appreciated!