agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

1D amino acid sequence generation from encoder output #105

Closed bg-uni closed 1 year ago

bg-uni commented 1 year ago

Hello, I have a question.

Is there a way to generate a 1D amino acid sequence from embedding obtained from last_hidden_state in T5EncoderModel?

Thanks for your help!

mheinzinger commented 1 year ago

No, there is no direct way to do so. However, it should be relatively easy to add such a module. I think you should be able to recycle most of the code given in the huggingface-examples and replace the BERT/Encoder-models of the examples with the T5EncoderModel. You might also need to adjust the mask token (T5 only has those span tokens, e.g. etc., not the traditional tokens but probably you could also simply recycle one of the span-tokens). This is something that we wanted to look into as well but did not find time for; it would be great if you could share whether your experiments are successful if you proceed in this direction :)