facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.29k stars 644 forks source link

Error while compute embeddings in bulk from FASTA #615

Closed smruti241 closed 1 year ago

smruti241 commented 1 year ago

While computing embeddings in bulk from fasta file, ( esm-extract esm2_t33_650M_UR50D all_spike_protein.fasta outputs/ --repr_layers 0 32 33 --include mean per_tok --nogpu) I am facing error: Traceback (most recent call last): File "/raid/home/smrutip/anaconda3/envs/genslm/bin/esm-extract", line 8, in sys.exit(main()) File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/scripts/extract.py", line 137, in main run(args) File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/scripts/extract.py", line 88, in run for batch_idx, (labels, strs, toks) in enumerate(data_loader): File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/data.py", line 266, in call seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list] File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/data.py", line 266, in seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list] File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/data.py", line 250, in encode return [self.tok_to_idx[tok] for tok in self.tokenize(text)] File "/raid/home/smrutip/anaconda3/envs/genslm/lib/python3.9/site-packages/esm/data.py", line 250, in return [self.tok_to_idx[tok] for tok in self.tokenize(text)] KeyError: 'J'

Can you please tell me what is the error about and how to rectify it? @tomsercu , @joshim5, @rmrao , @naailkhan28 , @liujas000 , @nikitos9000 , @ebetica , @chloechsu , @YaoYinYing

mdlakic commented 1 year ago

Could it be that you have a "J" in one of your fasta sequences?

smruti241 commented 1 year ago

thanks @mdlakic it worked.