Closed adrienchaton closed 2 years ago
Thanks for raising this issue, it is expected and actually here's the responsible code: https://github.com/facebookresearch/esm/blob/main/esm/data.py#L149
The ESM-1 was trained in this way, a slight difference with ESM-1b.
Hi, I observe a different behaviour for T6 and T12 checkpoints compared to 1bT33 which I am surprised of and could think it is not expected ...
Bug description After getting the pretrained alphabet and batch converter, it seems that encoded sequences miss the EOS token appended at the end or before padding. This is unusual and ESM1b has the expected behaviour.
Reproduction steps import esm
model, alphabet = getattr(esm.pretrained, "esm1_t6_43M_UR50S")() # same with esm1_t12_85M_UR50S batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]10))]) print(batchtokens) # [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] , _, batch_tokens = batch_converter([("0","".join(["A"]10)), ("1","".join(["A"]*12))]) print(batch_tokens) # [[32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1], [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 32 2 1
model, alphabet = getattr(esm.pretrained, "esm1b_t33_650M_UR50S")() batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]*10))]) print(batch_tokens) # [0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 0 2 1
Is it alright to use these pretrained models without the EOS token appended ?