Query on Alphabet Consistency Across Different Scales of ESM2 Models [8M, 35M, 150M, 650M, 3B, 150B]

I am currently exploring the potential of leveraging the ESM2 series of models for a project involving protein sequence analysis. Given the diversity in the scale of models available, I have a specific question that I hope you could help clarify. Could you please confirm if all these variants of the ESM2 models use an identical alphabet for encoding protein sequences into tokens? Essentially, I am interested in understanding whether the token sequences generated from the same protein sequence would be identical across these different model scales. The reason behind this inquiry is to ensure that our preprocessing pipeline remains consistent and compatible when utilizing multiple versions of the ESM2 models for comparative analysis.

facebookresearch / esm

Query on Alphabet Consistency Across Different Scales of ESM2 Models [8M, 35M, 150M, 650M, 3B, 150B] #668