facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

The length limitation of protein sequences for different size of ESM2 (8M, 35M, 150M,,,,,,,,)? #628

Open liudan111 opened 8 months ago

liudan111 commented 8 months ago

question description I have a question about the length limitation of protein sequences, for different sizes of the ESM2 model (8M,35M,150,,,,), what are the maximum lengths of proteins for each model?

There is a --truncation_seq_length parameter with a default value of 1024, is this for all sizes of the ESM2 model? If most of my protein sequences are longer than 2000, which ESM2 model is better?

wangleiofficial commented 8 months ago

ESM-2 series models adopt RoPE positional encoding. In theory, ESM-2 can be used beyond the length limit requirement of 1024, but as the sequence length increases, the memory requirements are huge. You can consider using our recently developed lightweight language model ProtFlash (https://github.com/ISYSLAB-HUST/ProtFlash), which is extremely memory-friendly.