'IndexError: list index out of range' after 'Number of phage genomes detected with mash distance of < 0.2 is:4'

GeoMicroSoares commented 1 year ago

Hi there @amillard,

I'm trying to apply this tool to my dataset of a couple hundreds of metagenome-recovered viruses (consensus viral sequences acc. to VIBRANT & VirSorter2 with >70% checkV completeness & checked for eukaryotic sequences). However, I keep getting a IndexError: list index out of range error. More output below - the genome I ran tried the tool with is 378,353bp as indicated in the name.

$ python tax_myPHAGE/tax_myPHAGE.py -t 10 -i viruses_oneGenomeFasta/viral_scaffolds_mgshot_S7938Nr1_lt70_checkv_noEuks.id_mgshot_S7938Nr1_27_length_378353_cov_15.fasta --Figures F
Found virus.a/tax_myPHAGE.a/tax_myPHAGE/VMR.xlsx as expected
Found virus.a/tax_myPHAGE.a/tax_myPHAGE/ICTV.msh as expected
Found virus.a/tax_myPHAGE.a/tax_myPHAGE/Bacteriophage_genomes.fasta as expected

Warning: Directory 'virus.a/tax_myPHAGE.a/viral_scaffolds_mgshot_S7938Nr1_lt70_checkv_noEuks.id_mgshot_S7938Nr1_27_length_378353_cov_15_taxmyphage_results'already exists. All results will be overwritten.

        Number of phage genomes detected with mash distance of < 0.2 is:4
The mash distances obtained for this query phage
    is a minimum value of nan and maximum value of nan
{}
Found 0 genera associated with this query genome
Traceback (most recent call last):
  File "virus.a/tax_myPHAGE.a/tax_myPHAGE/tax_myPHAGE.py", line 484, in <module>
    keys = [k for k, v in accession_genus_dict.items() if v == unique_genera[0]]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/virus.a/tax_myPHAGE.a/tax_myPHAGE/tax_myPHAGE.py", line 484, in <listcomp>
    keys = [k for k, v in accession_genus_dict.items() if v == unique_genera[0]]
                                                               ~~~~~~~~~~~~~^^^
IndexError: list index out of range

As a recommendation by the way, it would be great to be able to direct the output to a specific directory (maybe via the prefix option), and more explicit information as to how to set up the tool & databases would also be helpful (that all databases should be in the cloned directory, for example). Thank you for making this tool available & in advance for the help - looking forward to checking my data out with it!

amillard commented 1 year ago

Hi @GeoMicroSoares

Thank you for the feedback and using (or trying too) it. I have added more information on how to set it up now and added test file that will run quickly.

Can you confirm if can get it to run with the test.fna file that is now provided and get the same result we do ?

python tax_myPHAGE.py -i test.fna -t 8

It looks like it is failing at the moment as it not similar enough to anything else in the database at the moment ...But, it shouldn`t just fail it should be giving a message that the input is likely a new Genus. So clearly something is going wrong ...that we havent expected.

Can you confirm the above first please. If than works can you run one of you sequences with

python tax_myPHAGE/tax_myPHAGE.py -t 10 -i viruses_oneGenomeFasta/viral_scaffolds_mgshot_S7938Nr1_lt70_checkv_noEuks.id_mgshot_S7938Nr1_27_length_378353_cov_15.fasta --Figures F -v

Using the -v so we get a bit more output. Does the " mash.txt" you get in your output directory have anything in it ?

rdenise commented 1 year ago

@GeoMicroSoares Did your problem fixed with the new release?

778055611 commented 1 year ago

@amillard Hi， I encountered a similar situation, I try to run

python tax_myPHAGE.py -i test.fna -t 8

but I couldn't find your test file（test.fna）,could you please help me to solve my problem

amillard commented 1 year ago

@778055611

Sorry the file has been moved around in the re-organisation there is as file UP30.fsa in Uploads folder . You can use that as a test file

igortru commented 1 year ago

UP30.fsa Query sequence is: Class: Caudoviricetes Family: Drexlerviridae Subfamily: Tunavirinae Genus: Tunavirus Species: Tunavirus new_name

MN478483 is Taipeivirus ICTV exemplar taxmyphage -i MN478483.fasta -t 8 Number of phage genomes detected with mash distance of < 0.2 is:5 Classifying: 0%| | 0/1 [00:00<?, ?it/s] Traceback (most recent call last): File "~/.local/bin/taxmyphage", line 8, in sys.exit(main()) ^^^^^^ File "~/.local/lib/python3.11/site-packages/taxmyphage/main.py", line 98, in main mash_df, accession_genus_dict = classification_mash( ^^^^^^^^^^^^^^^^^^^^ File "~/.local/lib/python3.11/site-packages/taxmyphage/classify.py", line 193, in classification_mash keys = [k for k, v in accession_genus_dict.items() if v == unique_genera[0]] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/.local/lib/python3.11/site-packages/taxmyphage/classify.py", line 193, in keys = [k for k, v in accession_genus_dict.items() if v == unique_genera[0]]


IndexError: list index out of range

igortru commented 1 year ago

it looks like you have problem with reading fasta defline in query file

I have changed from

gi|1774218871|gb|MN478483.1| Klebsiella phage UPM 2146, complete genome to MN478483 and now report looks fine

rdenise commented 1 year ago

Ok so mostly it is because the genome identifier has characters that are not allowed in folder name. I'll try to modify that in the new release

amillard / tax_myPHAGE

'IndexError: list index out of range' after 'Number of phage genomes detected with mash distance of < 0.2 is:4' #1