facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

ValueError: Sequence length 1042 above maximum sequence length of 1024. #166

Closed liudan111 closed 2 years ago

liudan111 commented 2 years ago

code: python extract.py esm1b_t33_650M_UR50S AB024414.fasta esm1b/AB024414 --repr_layers 0 32 33 --include mean

Bug description Transferred model to GPU Read /home1/……/AB024414.fasta with 65 sequences Processing 1 of 11 batches (16 sequences) Processing 2 of 11 batches (11 sequences) Processing 3 of 11 batches (9 sequences) Processing 4 of 11 batches (7 sequences) Processing 5 of 11 batches (5 sequences) Processing 6 of 11 batches (5 sequences) Processing 7 of 11 batches (4 sequences) Processing 8 of 11 batches (3 sequences) Traceback (most recent call last): File "extract.py", line 136, in main(args) File "extract.py", line 95, in main out = model(toks, repr_layers=repr_layers, return_contacts=return_contacts) File "/home1/……/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home1/……/tool/esm/esm/model.py", line 136, in forward x = x + self.embed_positions(tokens) File "/home1/……/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home1/……/tool/esm/esm/modules.py", line 242, in forward f"Sequence length {input.size(1)} above maximum " ValueError: Sequence length 1042 above maximum sequence length of 1024.

Do I need to split my protein sequence into lengths of 1024? Why there is an issue like this? I would be appreciated that if you could help me.

tomsercu commented 2 years ago

Thanks for your question! During the training process we cropped sequences >1024 sequences, so the model (specifically the learned positional embeddings) can not handle longer sequences. See #21 and #76 for prior discussion on this topic.

liudan111 commented 2 years ago

Thanks for your question! During the training process we cropped sequences >1024 sequences, so the model (specifically the learned positional embeddings) can not handle longer sequences. See #21 and #76 for prior discussion on this topic.

Thank you for your reply!