facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

missing EOS token in encoded sequences for T6 and T12 #241

Closed adrienchaton closed 2 years ago

adrienchaton commented 2 years ago

Hi, I observe a different behaviour for T6 and T12 checkpoints compared to 1bT33 which I am surprised of and could think it is not expected ...

Bug description After getting the pretrained alphabet and batch converter, it seems that encoded sequences miss the EOS token appended at the end or before padding. This is unusual and ESM1b has the expected behaviour.

Reproduction steps import esm

model, alphabet = getattr(esm.pretrained, "esm1_t6_43M_UR50S")() # same with esm1_t12_85M_UR50S batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]10))]) print(batchtokens) # [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] , _, batch_tokens = batch_converter([("0","".join(["A"]10)), ("1","".join(["A"]*12))]) print(batch_tokens) # [[32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1], [32, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 32 2 1

model, alphabet = getattr(esm.pretrained, "esm1b_t33_650M_UR50S")() batch_converter = alphabet.get_batchconverter() , _, batch_tokens = batch_converter([("0","".join(["A"]*10))]) print(batch_tokens) # [0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2] print(alphabet.cls_idx, alphabet.eos_idx, alphabet.padding_idx) # 0 2 1

Is it alright to use these pretrained models without the EOS token appended ?

tomsercu commented 2 years ago

Thanks for raising this issue, it is expected and actually here's the responsible code: https://github.com/facebookresearch/esm/blob/main/esm/data.py#L149

The ESM-1 was trained in this way, a slight difference with ESM-1b.