Availability of Seq2Seq Autoencoder Models for Protein Sequences

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.05k stars 150 forks source link

Availability of Seq2Seq Autoencoder Models for Protein Sequences #122

Closed khokao closed 1 year ago

khokao commented 1 year ago

Hi, thanks for your great work!

Are there any models provided in this repository that can be used as a sequence-to-sequence autoencoder?

More specifically, my interest lies in extracting features from protein sequences and then reconstructing the protein sequences from those extracted features.

mheinzinger commented 1 year ago

We have no such model that would do the job out-of-the-box. What you could do is: extract features using e.g. the encoder-side of ProtT5, get a compressed version thereof (either average pooling over length-dimension or learnt auto-encoder compression e.g. via attention-pooling) and then reconstruct the original sequence from there using a decoder model (maybe recycle existing ProtGPT2: https://huggingface.co/nferruz/ProtGPT2 )

khokao commented 1 year ago

I see. Thank you very much for the swift reply!