facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Error in batch_converter from PyPI release #176

Closed nickbhat closed 2 years ago

nickbhat commented 2 years ago

This seems related to #161

The code in the README example does not work as intended with the PyPI build. The <mask> string is converted to a series of unks, rather than a mask token.

Reproduction steps Install using pip, following the README, pip install fair-esm. Run the following example

model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()

data = [
    ("protein1",  "K A <mask> I S Q"),
    ("protein2",  "KA<mask>ISQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

print(batch_tokens)

print(f"Mask idx: {alphabet.mask_idx}")
print(f"Unk idx: {alphabet.unk_idx}")

The output I get is

tensor([[ 0, 15,  3,  5,  3,  3,  3,  3,  3,  3,  3,  3, 12,  3,  8,  3, 16,  2],
        [ 0, 15,  5,  3,  3,  3,  3,  3,  3, 12,  8, 16,  2,  1,  1,  1,  1,  1]])
Mask idx: 32
Unk idx: 3

Expected behavior I assume the intended output both line 1 and line 2 is tensor([0, 15, 5, 32, 12, 8, 16, 2]) However, the <mask> string is converted to a series of unks, as are the whitespaces.

tomsercu commented 2 years ago

Thanks for calling that out, will do the new pip build as soon as some recent updates get merged.

tomsercu commented 2 years ago

This should be resolved now with current main branch released as a new version on pip.

! pip install fair-esm

import esm

model, alphabet = esm.pretrained.esm1_t6_43M_UR50S()
batch_converter = alphabet.get_batch_converter()

data = [
    ("protein1",  "K A <mask> I S Q"),
    ("protein2",  "KA<mask>ISQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

print(batch_tokens)

print(f"Mask idx: {alphabet.mask_idx}")
print(f"Unk idx: {alphabet.unk_idx}")

gives the desired result.

Plesae give it a try!