missing EOS token in encoded sequences for T6 and T12

Hi, I observe a different behaviour for T6 and T12 checkpoints compared to 1bT33 which I am surprised of and could think it is not expected ...

Bug description After getting the pretrained alphabet and batch converter, it seems that encoded sequences miss the EOS token appended at the end or before padding. This is unusual and ESM1b has the expected behaviour.

Reproduction steps import esm

model, alphabet = getattr(esm.pretrained, "esm1_t6_43M_UR50S")() # same with esm1_t12_85M_UR50S batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]10))]) print(batchtokens) # [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] , _, batch_tokens = batch_converter([("0","".join(["A"]10)), ("1","".join(["A"]*12))]) print(batch_tokens) # [[32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1], [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 32 2 1

model, alphabet = getattr(esm.pretrained, "esm1b_t33_650M_UR50S")() batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]*10))]) print(batch_tokens) # [0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 0 2 1

Is it alright to use these pretrained models without the EOS token appended ?

facebookresearch / esm

missing EOS token in encoded sequences for T6 and T12 #241