agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.1k stars 152 forks source link

Generating protein sequences #98

Closed gitwangxf closed 1 year ago

gitwangxf commented 1 year ago

Hi! I’ m trying to use the encoder and decoder of ProtT5 seperately, but failed to get correct protein sequences passing only the encoder_outputs to T5ForConditionalGeneration model as well as model.generate(). I was wondering if there is a way to generate protein sequences from the encoder outputs only? Thanks for your help!

mheinzinger commented 1 year ago

Hi; hm, I see multiple options:

gitwangxf commented 1 year ago

Thanks for your detailed reply! That'd be very helpful~ I just realized that model.generate() in T5ForConditionalGeneration can accept "encoder_outputs" in PT but not in TF, and the protein sequences generated in this way are of low quality if there are "B"s in the original input protein sequences.

mheinzinger commented 1 year ago

Ah, interesting; thanks for sharing. I usually use only the PT output so I probably never encountered this issue. Re. "B" in the input: this is one of the rare/ambiguous amino acids which means that it is heavily under-represented in our training corpus. We do not expect the model to learn any meaningful representation of this amino acid. In our experiments, we usually map all non-standard tokens to "X" (unknown).

gitwangxf commented 1 year ago

Oh I see, then I suppose it would be the same for U,Z and O~