Closed JinLi711 closed 3 years ago
Hi @JinLi711, great question. Here are a list of tokens that the models recognize. In addition to these, the models also recognize <cls>, <mask>, <pad>, <unk>
, etc.
If you use tokens outside this vocabulary, for example gap characters (-
or .
), the batch converter will convert these to <unk>
. Let us know if you have any further questions!
Hi @joshim5 !
Are you sure that the batch converter converts -
to <unk>
?
I'm running:
import torch
import esm
# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (two protein sequences)
data = [("protein1", "MYL--KIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
But I'm getting the error message:
Traceback (most recent call last):
File "script.py", line 10, in <module>
batch_labels, batch_strs, batch_tokens = batch_converter(data)
File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 150, in __call__
seq = torch.tensor([self.alphabet.get_idx(s) for s in seq_str], dtype=torch.int64)
File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 150, in <listcomp>
seq = torch.tensor([self.alphabet.get_idx(s) for s in seq_str], dtype=torch.int64)
File "/home/jinli11/miniconda3/envs/ESM/lib/python3.7/site-packages/esm/data.py", line 106, in get_idx
return self.tok_to_idx[tok]
KeyError: '-'
which is telling me that the models are unable to accept unknown characters.
Is there any way I can get around this?
Thank you!
Hi @JinLi711, you're right - the functionality that converts -
to <unk>
was missing. I just added a fix.
Thanks for reporting and please let us know if you have any outstanding issues.
Does the ESM model deal with special symbols for proteins?
Does it deal with input sequences with gaps? For example, sequence =
---AB----C
?Does it deal with ambiguous residues like
BZJX
?Thank you!