Rostlab / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
http://embed.protein.properties
MIT License
116 stars 13 forks source link

about change the proteins' length to a same lenth #23

Closed viko-3 closed 2 years ago

viko-3 commented 2 years ago

Hi, author,Good job for this code! But I have a problem. If I have some proteins, such as ['VRWFPFDVQHCKLK', 'PFDVQHC',...] As you can see, they have different length, but I want to deal them to a same length. So can I use the padding method in NLP? If so, which token should I pick as the pad character?

mheinzinger commented 2 years ago

Hi; thanks for your interest in SeqVec :) Handling different length inputs is handled internally by SeqVec/ELMo, so you don't have to worry about this. If you wonder how to retrieve fixed-length embeddings from SeqVec, irrespective of the protein's length, you can simply average over the embedding dimension.

xinformatics commented 2 years ago

Hi @mheinzinger. Could you please tell me how this model handles different length inputs?

mheinzinger commented 2 years ago

Hi @xinformatics , we are using the tf-bilm implementation of ELMo to train SeqVec. The Batcher class handles the padding: https://github.com/allenai/bilm-tf/blob/master/bilm/data.py#L193 From a quick glance (did not check this in a while), I saw that the length of the longest sequence within a batch is determined and that all sequences shorter than this length are padded using a special token: https://github.com/allenai/bilm-tf/blob/master/bilm/data.py#L126