Closed gitwangxf closed 1 year ago
Hi; hm, I see multiple options:
Thanks for your detailed reply! That'd be very helpful~ I just realized that model.generate() in T5ForConditionalGeneration can accept "encoder_outputs" in PT but not in TF, and the protein sequences generated in this way are of low quality if there are "B"s in the original input protein sequences.
Ah, interesting; thanks for sharing. I usually use only the PT output so I probably never encountered this issue. Re. "B" in the input: this is one of the rare/ambiguous amino acids which means that it is heavily under-represented in our training corpus. We do not expect the model to learn any meaningful representation of this amino acid. In our experiments, we usually map all non-standard tokens to "X" (unknown).
Oh I see, then I suppose it would be the same for U,Z and O~
Hi! I’ m trying to use the encoder and decoder of ProtT5 seperately, but failed to get correct protein sequences passing only the encoder_outputs to T5ForConditionalGeneration model as well as model.generate(). I was wondering if there is a way to generate protein sequences from the encoder outputs only? Thanks for your help!