gbouras13 / pharokka

fast phage annotation program
MIT License
147 stars 15 forks source link

ValueError: Duplicate key (while no duplicated ids in input) #361

Open art-egorov opened 4 weeks ago

art-egorov commented 4 weeks ago

Description

Hi!

When running pharokka on meta mode it returns a funky error about duplication of key which looks like contig id. The problem that it fails only on subset of sequences (for most runs it is ok), moreover, reported duplicated key is not present in the list of input files..

What I Did

Command:

pharokka.py -i FAILED_SEQS.fa  -o pharokka_batches/ALL_SEQS  --meta --split -t 45 --skip_mash --dnaapler  --database pharokka/pharokka_v1.4.0_databases

logs:

024-10-04 13:21:28.625 | INFO     | external_tools:run:50 - Started running mmseqs createtsv pharokka/pharokka_v1.4.0_databases/vfdb pharokka_batches/ALL_SEQS/VFDB_target_dir/target_seqs pharokka_batches/ALL_SEQS/VFDB/results_mmseqs pharokka_batches/ALL_SEQS/vfdb_results.tsv --full-header --threads 45 ...
2024-10-04 13:21:28.773 | INFO     | external_tools:run:52 - Done running mmseqs createtsv pharokka/pharokka_v1.4.0_databases/vfdb pharokka_batches/ALL_SEQS/VFDB_target_dir/target_seqs pharokka_batches/ALL_SEQS/VFDB/results_mmseqs pharokka_batches/ALL_SEQS/vfdb_results.tsv --full-header --threads 45
2024-10-04 13:21:28.828 | INFO     | __main__:main:364 - Post Processing Output.
Traceback (most recent call last):
  File "/home/aegorov/.conda/envs/pharokka_env/bin/pharokka.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/pharokka.py", line 489, in <module>
    main()
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/pharokka.py", line 403, in main
    pharok.process_results()
  File "/lunarc/nobackup/projects/lutafold/aegorov/Hotspots/Phages/pharokka/bin/post_processing.py", line 204, in process_results
    prot_dict = SeqIO.to_dict(SeqIO.parse(fasta_input_aas_tmp, "fasta"))
  File "/home/aegorov/.conda/envs/pharokka_env/lib/python3.10/site-packages/Bio/SeqIO/__init__.py", line 754, in to_dict
    raise ValueError(f"Duplicate key '{key}'")
ValueError: Duplicate key 'TemPhD_cluster_4683480'
(pharokka_env) [aegorov@cn001 Phages]$ cat FAILED_SEQS.fa  | grep TemPhD_cluster_4683480
(pharokka_env) [aegorov@cn001 Phages]$ 
(pharokka_env) [aegorov@cn001 Phages]$ grep TemPhD_cluster_4683480 PhageScope_annotation_filtered.tsv 
(pharokka_env) [aegorov@cn001 Phages]$ 

seems like it adds some suffix numbers for prodigal which then overlaps with other contigs?

pharokka_batches/ALL_SEQS/prodigal-gv_aas_tmp.fasta:>TemPhD_cluster_4683480 1299_2654
pharokka_batches/ALL_SEQS/prodigal-gv_aas_tmp.fasta:>TemPhD_cluster_4683480 1_1272

Because in fasta file i have the following, for instance:

pharokka_env) [aegorov@cn001 Phages]$ grep "4683" FAILED_SEQS.fa 
>TemPhD_cluster_46833
>TemPhD_cluster_46834
>TemPhD_cluster_46835
>TemPhD_cluster_46836
>TemPhD_cluster_46837
>TemPhD_cluster_46838
>TemPhD_cluster_46839
>TemPhD_cluster_4683

Anyway, is there anything to do to fix such exceptions? Thanks in advance

Best, Artyom

art-egorov commented 4 weeks ago

small upd: you can avoid the error by adding non-number suffix to contig id, but still.. would be nice to be able to run on any unique set

gbouras13 commented 3 weeks ago

I fully agree @art-egorov - the issue is with phanotate that makes the output hard to parse and I didn't know a smarter way back when I coded pharokka originally.

When I have some dev time, I'll try and think of a better solution.

George