Closed josephyu12 closed 1 year ago
Hi,
there is no direct option that would allow you this but we made surprisingly positive experience for various sequence-level tasks, s.a., subcellular-localization etc., by simply averaging over the length dimension. For example, if you have an embedding of shape Lx1024, you simply compute the mean over the L-dimension to end up with a single vector of shape 1024 which is irrespective of a protein's length.
Ok, I will try averaging! I'm also wondering though, could the embedding for the first token "[CLS]" be used to represent the entire sequence?
Sure that's also something you can try. Especially, with ProtT5's special token that gets appended to the end of each sequence (</s>
, I think) we did not experiment much. Definitely something worth comparing.
Hi,
So far the embeddings I've seen using ProtTrans are all residue-level (one for each amino acid). For reference, I am looking at the ipynb examples in the "ProtTrans/Embeddings" folder. Do their models have an option to output a sequence-level embedding? (That is, only 1 embedding for the entire sequence). Thanks everyone for your help!