Closed liudan111 closed 2 years ago
J is an ambiguous amino acid that is I assume ESM not trained for.
J is an ambiguous amino acid that is I assume ESM not trained for.
Thank you for your reply! you are right, I try to run several protein sequences including J, the errors are all "KeyError: 'J'"
Thanks @MesihK - Correct, esm is trained on the 20 standard amino acids plus some ambiguous amino acids like X, B, Z, etc, if they appear in uniref-50. See https://github.com/facebookresearch/esm/blob/main/esm/constants.py#L8
Same Error here, but my interest is why uniref50 treat J
so differently. I mean, X, B, J and Z are all ambiguous amino acids. From wiki, B
indicates Asparagine or aspartic acid, J
indicates Leucine or isoleucine, X
stands for Unknown and Z
stands for Glutamic acid or glutamine. So every letters in English Alphabet are covered except Letter J, are there any considerations?
Hello! Is there a recommended solution for this? Should we substitute an X for every J in a protein record before calling extract.py?
Hello! Is there a recommended solution for this? Should we substitute an X for every J in a protein record before calling extract.py?
Yeah, I think it is the best solution
code:
python extract.py esm1_t34_670M_UR50S run.fasta AB644285 --repr_layers 34 --include mean
Bug description Traceback (most recent call last): File "extract.py", line 136, in
main(args)
File "extract.py", line 83, in main
for batch_idx, (labels, strs, toks) in enumerate(data_loader):
File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home1//anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home1//tool/esm/esm/data.py", line 258, in call
seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
File "/home1//tool/esm/esm/data.py", line 258, in
seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
File "/home1/ /tool/esm/esm/data.py", line 243, in encode
return [self.tok_to_idx[tok] for tok in self.tokenize(text)]
File "/home1/*/tool/esm/esm/data.py", line 243, in
return [self.tok_to_idx[tok] for tok in self.tokenize(text)]
KeyError: 'J'
When I input a protein sequence like:
why there is an issue like this? I would be appreciated that if you could help me.