facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Possible mistake in batch_converter #161

Closed Haasha closed 2 years ago

Haasha commented 2 years ago

Bug description When i view tok_to_idx from alphabet, there are 33 unique classes which looks something like this: -

{
'<cls>': 0,
 '<pad>': 1,
 .
.
 '<null_1>': 31,
 '<mask>': 32
 }

Suppose the function obtained from alphabet.get_batch_converter( ) is stored in batch_converter just as given in Readme, when we pass tokens apart from Amino acids, it is considering them to be <unk>

Reproduction steps Use these lines of code to replicate the issue: -

import torch
import esm

model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()
data = [
    ("protein1", "A<pad>"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
print(batch_tokens)

Expected behavior According to my understanding, this should be the output of the above written code: - tensor([[0, 5, 1, 2]])

Current behavior This is what the output is being obtained from the above written code: - tensor([[0, 5, 3, 3, 3, 3, 3, 2]]) This seems like that the function batch_converter is treating all the alphabets of <pad> as individual tokens i.e., '<', 'p', 'a','d','>' due to which it appends five 3's in the batch_tokens because 3 represents unknown tokens.

Additional context If the tokenizer isn't working correctly, I cant rely at all on the embeddings being generated by the model. Looking forward to hearing from you. I am also uploading the image of the function _tokenizer from alphabet which shows that text.split( ) is being called and the text is being converted into individual characters rather than the above given classes. image_2022-01-27_104659

The issue seems to be linked with the changes made in the following commit "3bb552aff1e3a7debcf3e0828c107b5c6e96938b". I'm looking forward to hearing from you regarding this. Thanks

Regards, Haasha Bin Atif

liujas000 commented 2 years ago

Hi @Haasha , thanks for the detailed steps! I just checked this exact case (locally added it as a test case in tests/test_alphabet.py , and was able to get tensor([[0, 5, 1, 2]]) from "A".

I won't rule out the likelihood of a bug (since the tokenization code is definitely a bit hacky), but can you first make sure that you are running with the latest version of esm?

Edit: I also opened an interpreter and ran the code you had above

>>> import esm
>>> model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
>>> batch_converter = alphabet.get_batch_converter()
>>> data = [
    ("protein1", "A<pad>"),
]
>>> batch_labels, batch_strs, batch_tokens = batch_converter(data)
>>> print(batch_tokens)
tensor([[0, 5, 1, 2]])
Haasha commented 2 years ago

I installed esm using !pip install fair-esm and the latest version on PyPI seems to be 0.4.0 which was released on July 12,2021. Mine is also the same although I'll try cloning the current GitHub repo and test it out again.

Haasha commented 2 years ago

@liujas000 Thanks for your quick response. Yes, you were right, upon cloning the repo from github removed the issues which I highlighted. I would recommend if you people can also release a new build on PyPI so that if anyone tries to use it, the package is stable and works as it is supposed to.