Sequence-level embeddings using ProtTrans models

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.13k stars 153 forks source link

Sequence-level embeddings using ProtTrans models #99

Closed josephyu12 closed 1 year ago

josephyu12 commented 1 year ago

Hi,

So far the embeddings I've seen using ProtTrans are all residue-level (one for each amino acid). For reference, I am looking at the ipynb examples in the "ProtTrans/Embeddings" folder. Do their models have an option to output a sequence-level embedding? (That is, only 1 embedding for the entire sequence). Thanks everyone for your help!

mheinzinger commented 1 year ago

Hi,

there is no direct option that would allow you this but we made surprisingly positive experience for various sequence-level tasks, s.a., subcellular-localization etc., by simply averaging over the length dimension. For example, if you have an embedding of shape Lx1024, you simply compute the mean over the L-dimension to end up with a single vector of shape 1024 which is irrespective of a protein's length.

josephyu12 commented 1 year ago

Ok, I will try averaging! I'm also wondering though, could the embedding for the first token "[CLS]" be used to represent the entire sequence?

mheinzinger commented 1 year ago

Sure that's also something you can try. Especially, with ProtT5's special token that gets appended to the end of each sequence (</s>, I think) we did not experiment much. Definitely something worth comparing.