Generating protein sequences

gitwangxf commented 1 year ago

Hi! I’ m trying to use the encoder and decoder of ProtT5 seperately, but failed to get correct protein sequences passing only the encoder_outputs to T5ForConditionalGeneration model as well as model.generate(). I was wondering if there is a way to generate protein sequences from the encoder outputs only? Thanks for your help!

mheinzinger commented 1 year ago

Hi; hm, I see multiple options:

BERT-like-approach: you train a simple NN on top of encoder-outputs and predict amino acid identities. For this, you might want to mask some tokens using the special tokens used by ProtT5 in order to make the task not too simple. Probably you can get away without fine-tuning the whole encoder. Will be computationally light-weight. But after all, this will still not be a real generation approach as it will still require some non-corrupted context from which it will only recover a certain fraction of (of course, if you choose your masks in the right way you can sequentially generate tokens but this might give worse performance than a decoder-approach).
Fine-tune approach: we trained only on span-generation (after all, we used span-length=1 which is similar to denoising/masked-language-modeling). So ProtT5 can give you logits for masked positions but it might not give very good results for the "traditional"/GPT-like generation tasks. For this to work well, you probably want to fine-tune whole ProtT5. This will require some compute.
Only train a decoder model: you could try to only train a decoder given the ProtT5 encoder-outputs. But for this, I would probably rather go for the option above and fine-tune ProtT5 right away.

gitwangxf commented 1 year ago

Thanks for your detailed reply! That'd be very helpful~ I just realized that model.generate() in T5ForConditionalGeneration can accept "encoder_outputs" in PT but not in TF, and the protein sequences generated in this way are of low quality if there are "B"s in the original input protein sequences.

mheinzinger commented 1 year ago

Ah, interesting; thanks for sharing. I usually use only the PT output so I probably never encountered this issue. Re. "B" in the input: this is one of the rare/ambiguous amino acids which means that it is heavily under-represented in our training corpus. We do not expect the model to learn any meaningful representation of this amino acid. In our experiments, we usually map all non-standard tokens to "X" (unknown).

gitwangxf commented 1 year ago

Oh I see, then I suppose it would be the same for U，Z and O~

agemagician / ProtTrans

Generating protein sequences #98