agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Generating embeddings in T5 #96

Closed BrahamVictor closed 2 years ago

BrahamVictor commented 2 years ago

Thanks for this magnificent work! Just wondering are there any solutions for generating sequence embeddings in real time. I found it seems the pretrained model takes a lot of memory and time for long protein sequence of length longer than ~6000.

mheinzinger commented 2 years ago

While we would like to provide such an online embedding-service, such an effort requires resources that we can't provide currently, unfortunately, sorry. While we think that running our pLMs is already reasonably fast for most tasks, we understand that there are problems where even more speed would be beneficial. For those problems, you could try other approaches that make existing models faster during inference/embedding-generation; an overview is given, for example, here:

https://nlpcloud.com/how-to-speed-up-deep-learning-nlp-transformers-inference.html

Alternatively, you can also download pre-computed ProtT5 embeddings of selected organisms: https://www.uniprot.org/help/embeddings

Or use our web-service to generate predictions: https://embed.predictprotein.org/

BrahamVictor commented 2 years ago

Thank you for your reply and provided sources. It would help me a lot.