facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

KeyError: 'J' #164

Closed liudan111 closed 2 years ago

liudan111 commented 2 years ago

code:

python extract.py esm1_t34_670M_UR50S run.fasta AB644285 --repr_layers 34 --include mean

Bug description Traceback (most recent call last): File "extract.py", line 136, in main(args) File "extract.py", line 83, in main for batch_idx, (labels, strs, toks) in enumerate(data_loader): File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home1//tool/esm/esm/data.py", line 258, in call seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list] File "/home1//tool/esm/esm/data.py", line 258, in seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list] File "/home1//tool/esm/esm/data.py", line 243, in encode return [self.tok_to_idx[tok] for tok in self.tokenize(text)] File "/home1/*/tool/esm/esm/data.py", line 243, in return [self.tok_to_idx[tok] for tok in self.tokenize(text)] KeyError: 'J'

When I input a protein sequence like:

BAL03319.1|/gene="P",/product="polymerase"||AB644285|join{2258:3164, 0:1575} MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNPNVSIPWTHKVGNFT GLYSSTVPVFNPEWQTPSFPDIHLREDIIDRCQQYVGPLTVNEKRRLKLIMPARFYPNFT KYMPLDKGIKPYYPEHAVNHYFKTRHYLHTLWKAGILYKRETTRSASFFGSPYSWEQDLH HGAFVDGPSRVGKESFYQQSSGVLSRPPVGSRIPRKFQQSRLGFQSQQGSLASGKSGRSG SIRARVHPTTRRSVGVEPASSGHIDNSASSASSCLHQSAVRKTAYSHLSTSKRQSSSGHA VELHPCWWLQFRNSKPCSDYCLSHIVNLLEDWEPCIEHGEHNIRIPRTPARVTGGVFLVD KNPHNTTESRLVVDFSQFSRGSTRVSWPKFAVPNLQSLTNLLSSNLSWLSLDVSAAFYHL PLHPAAMPHLLVGSSGLPRYVARLSSTSRNINHQHGTMQDLHDSCSRNLYVSLMLLYKTF GWKLHLYSHPIILGFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDLVLGA KSVQHLESLYATITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKJKQ CFRKLPVNRPIDWKVCQRIVGLLGFAAPFTQCGYPALMPLYACIHAKQAFTFSPTYKAFL CKQYLNLYPVARQRSGLCQVFADATPTGWGLAIGHQRMRGTFVAPLPIHTAELLAACFAR SRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGI YRPLLRLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAWRPP

why there is an issue like this? I would be appreciated that if you could help me.

MesihK commented 2 years ago

J is an ambiguous amino acid that is I assume ESM not trained for.

liudan111 commented 2 years ago

J is an ambiguous amino acid that is I assume ESM not trained for.

Thank you for your reply! you are right, I try to run several protein sequences including J, the errors are all "KeyError: 'J'"

tomsercu commented 2 years ago

Thanks @MesihK - Correct, esm is trained on the 20 standard amino acids plus some ambiguous amino acids like X, B, Z, etc, if they appear in uniref-50. See https://github.com/facebookresearch/esm/blob/main/esm/constants.py#L8

zhoubay commented 1 year ago

Same Error here, but my interest is why uniref50 treat J so differently. I mean, X, B, J and Z are all ambiguous amino acids. From wiki, B indicates Asparagine or aspartic acid, J indicates Leucine or isoleucine, X stands for Unknown and Z stands for Glutamic acid or glutamine. So every letters in English Alphabet are covered except Letter J, are there any considerations?

dariotommasini commented 5 months ago

Hello! Is there a recommended solution for this? Should we substitute an X for every J in a protein record before calling extract.py?

zhoubay commented 5 months ago

Hello! Is there a recommended solution for this? Should we substitute an X for every J in a protein record before calling extract.py?

Yeah, I think it is the best solution