facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

Query on Alphabet Consistency Across Different Scales of ESM2 Models [8M, 35M, 150M, 650M, 3B, 150B] #668

Open CNwangbin opened 3 months ago

CNwangbin commented 3 months ago

I am currently exploring the potential of leveraging the ESM2 series of models for a project involving protein sequence analysis. Given the diversity in the scale of models available, I have a specific question that I hope you could help clarify. Could you please confirm if all these variants of the ESM2 models use an identical alphabet for encoding protein sequences into tokens? Essentially, I am interested in understanding whether the token sequences generated from the same protein sequence would be identical across these different model scales. The reason behind this inquiry is to ensure that our preprocessing pipeline remains consistent and compatible when utilizing multiple versions of the ESM2 models for comparative analysis.