facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Special Symbols In Proteins #13

Closed JinLi711 closed 3 years ago

JinLi711 commented 3 years ago

Does the ESM model deal with special symbols for proteins?

Does it deal with input sequences with gaps? For example, sequence = ---AB----C?

Does it deal with ambiguous residues like BZJX?

Thank you!

joshim5 commented 3 years ago

Hi @JinLi711, great question. Here are a list of tokens that the models recognize. In addition to these, the models also recognize <cls>, <mask>, <pad>, <unk>, etc.

If you use tokens outside this vocabulary, for example gap characters (- or .), the batch converter will convert these to <unk>. Let us know if you have any further questions!

JinLi711 commented 3 years ago

Hi @joshim5 !

Are you sure that the batch converter converts - to <unk>?

I'm running:

import torch
import esm

# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (two protein sequences)
data = [("protein1", "MYL--KIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

But I'm getting the error message:

Traceback (most recent call last):
  File "script.py", line 10, in <module>
    batch_labels, batch_strs, batch_tokens = batch_converter(data)
  File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 150, in __call__
    seq = torch.tensor([self.alphabet.get_idx(s) for s in seq_str], dtype=torch.int64)
  File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 150, in <listcomp>
    seq = torch.tensor([self.alphabet.get_idx(s) for s in seq_str], dtype=torch.int64)
  File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 106, in get_idx
    return self.tok_to_idx[tok]
KeyError: '-'

which is telling me that the models are unable to accept unknown characters.

Is there any way I can get around this?

Thank you!

joshim5 commented 3 years ago

Hi @JinLi711, you're right - the functionality that converts - to <unk> was missing. I just added a fix.

Thanks for reporting and please let us know if you have any outstanding issues.

jzazo commented 3 years ago

This fix has broken things. See my description in the commit.

joshim5 commented 3 years ago

@jzazo thanks for flagging, now resolved in this commit.