Rostlab / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
http://embed.protein.properties
MIT License
116 stars 13 forks source link

IndexError: list index out of range #25

Closed HinamonAmu closed 2 years ago

HinamonAmu commented 2 years ago

Hi, When I try to run the following command to get the embedding of "fallback_test_sequences.fasta" file: seqvec -i fallback_test_sequences.fasta -o test.npz Error occurred:IndexError: list index out of range

I can successfully run the command "seqvec -i sequences.fasta -o embeddings.npz". But I don't think there is any difference between "fallback_test_sequences.fasta" and "sequences.fasta" . Please tell me the solution. Thanks and best regards.

mheinzinger commented 2 years ago

Thanks for bringing this up; I was able to reproduce this error on our end. The reason for the error is how the FASTA-Headers are interpreted and split. We try to retrieve a unique protein identifier from the fasta header by splitting it based on the --split-char Parameter and picking one of the resulting fields using the --id parameter. The defaults are based on UniProt/SwissProt headers which can be split by '|' and picking the element with index 1 (so for the first header in the sequences.fasta ">sp|P00864|..." this results in "P00864"). However for our fallback_test_sequences there is no "|" in the header so accessing field with index 1 results in an IndexError.

An easy solution for your problem is to adjust the field index of the header that you would like to use as unique identifier. In case of fallback_test_sequences you can simply set it to 0 to avoid this error: seqvec -i fallback_test_sequences.fasta -o test.npz --id 0

In general, I do not consider this a bug as it gives users the flexibility to adjust the fields they want to retrieve from FASTA-headers as unique protein IDs.