agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

can protT5 be used to generate embeddings for sequence fragments rather than full sequences? #101

Closed zhuzihan728 closed 1 year ago

zhuzihan728 commented 1 year ago

Will the resulting embeddings make sense if the model only sees fragments of a protein sequence instead of the whole sequence?

mheinzinger commented 1 year ago

Short answer: yes, that's technically possible. Long answer: the quality of the embedding will heavily depend on the length of your fragment. For example, some PDB-sequences are length-wise rather on the fragment- than on the protein-side. Those are usually still e.g. ~50 residues long. If you talk about fragments of length 5, I am skeptical how much information you still get from those.

But I guess it would be best to just try it as the numbers I give above (5 and 50) are also just examples and we have no precise idea how small a fragment can be to still give reasonable embeddings. If you got some results on this it would be great if you could share them here at one point :)