Closed HinamonAmu closed 2 years ago
Thanks for bringing this up; I was able to reproduce this error on our end. The reason for the error is how the FASTA-Headers are interpreted and split. We try to retrieve a unique protein identifier from the fasta header by splitting it based on the --split-char Parameter and picking one of the resulting fields using the --id parameter. The defaults are based on UniProt/SwissProt headers which can be split by '|' and picking the element with index 1 (so for the first header in the sequences.fasta ">sp|P00864|..." this results in "P00864"). However for our fallback_test_sequences there is no "|" in the header so accessing field with index 1 results in an IndexError.
An easy solution for your problem is to adjust the field index of the header that you would like to use as unique identifier. In case of fallback_test_sequences you can simply set it to 0 to avoid this error: seqvec -i fallback_test_sequences.fasta -o test.npz --id 0
In general, I do not consider this a bug as it gives users the flexibility to adjust the fields they want to retrieve from FASTA-headers as unique protein IDs.
Hi, When I try to run the following command to get the embedding of "fallback_test_sequences.fasta" file: seqvec -i fallback_test_sequences.fasta -o test.npz Error occurred:IndexError: list index out of range
I can successfully run the command "seqvec -i sequences.fasta -o embeddings.npz". But I don't think there is any difference between "fallback_test_sequences.fasta" and "sequences.fasta" . Please tell me the solution. Thanks and best regards.