gbouras13 / pharokka

fast phage annotation program
MIT License
146 stars 14 forks source link

dropped contigs #316

Closed k6logc closed 8 months ago

k6logc commented 9 months ago

Description

Dear George, Thank you for pharokka! We have it running overall well but we're getting some contigs dropping out and I can't tell what's going on. The entire folders for the contigs that error out are automatically deleted. There is nothing remarkable I can see about the dropped contigs in terms of sequence or name. Any guidance very much appreciated! Best, Kathryn

What I Did

Example 1 (contig JAAE01000196.1; yes, super short but there are others that are shorter): Errors out at customdb step (runs fine for other contigs):

2023-12-22 13:47:33.531 | INFO     | __main__:main:367 - Running PyHMMER on custom HMM database /projects/academic/kmkauffm/00.shared/00.cenoteTaker2.a/Cenote-Taker2/hmmscan_DBs_K2_concat_and_re-press/cenoteHmmsRepressed.deduped.
h3m.
Traceback (most recent call last):
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/pharokka.py", line 496, in <module>
    main()
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/pharokka.py", line 368, in main
    best_results_custom_pyhmmer = run_custom_pyhmmer(
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/custom_db.py", line 58, in run_custom_pyhmmer
    if best_results[result.protein].custom_hmm_id != hit.custom_hmm_id:
AttributeError: 'pyhmmer.plan7.Hit' object has no attribute 'custom_hmm_id'

Example 2 (contig LVER01000014.1): Errors out after CARD AMR Step (runs fine for other contigs):

2023-12-22 13:52:03.197 | INFO     | post_processing:process_card_results:2550 - 0 CARD AMR genes identified.
Traceback (most recent call last):
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/pharokka.py", line 496, in <module>
    main()
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/pharokka.py", line 426, in main
    pharok.create_tbl()
  File "/projects/academic/kmkauffm/kauffman/00.mambaforge/envs/pharokka.v1.5.1/bin/post_processing.py", line 1267, in create_tbl
    ""
TypeError: can only concatenate str (not "float") to str
gbouras13 commented 8 months ago

Hi @k6logc,

I take it (based on a Google of the contig names) these are from the Human Oral Microbiome Database, so I can hopefully reproduce this with the exact contigs.

With error 2, did you specify a custom numeric locus tag? It is erroring in the string parsing step with locus tags.

With error 1, my guess is that there is no predicted gene as per prodigal on that contig, which errors out with a custom DB.

Regarding The entire folders for the contigs that error out are automatically deleted. - are you running this with a workflow manager? If Pharokka errors out it shouldn't delete anything, so that is very strange.

I'll try and reproduce the bugs and get back to you.

George

gbouras13 commented 8 months ago

Looking a bit deeper into error 1, I was wrong after reproducing it.

It's an error caused by the line, where there are 2 identical scored hits to different custom HMMs profiles in your custom database.

if best_results[result.protein].custom_hmm_id != hit.custom_hmm_id:
AttributeError: 'pyhmmer.plan7.Hit' object has no attribute 'custom_hmm_id'

A fix will be implemented in v1.6.

George

k6logc commented 8 months ago

Hi @gbouras13,

Thank you so much for having a look and sorting out what's going on.

Great that you plan to implement a fix for error 1 in v1.6, thank you. Do you have a sense of approximately when v1.6 will be out?

For error 2 - no, we are not currently using custom numeric locus tags. You are right that the contigs are coming off HOMD.org - we are using the PROKKA versions (https://www.homd.org/ftp/genomes/PROKKA/V10.1/fna/) so the headers start prefixed with SEQFxxxxx.x, we pass them to geNomad, and currently we are taking the predicted phage regions into pharokka (as, for example, SEQF10001.1_JAAE01000196.1.fna, SEQF10002.1_LVER01000014.1.fna, or SEQF10001.1_JAAE01000004.1_provirus_6680_45243.fna) using a snakemake workflow (looping in @AmrutaIdagunji who is working on this with me), ideally we update to passing in gbks but we ran into some issues with this and will revisit.

Thank you!

Best, Kathryn

gbouras13 commented 8 months ago

Hi @k6logc ,

Hopefully sometime in the next week v1.6 will be done, I'm just working through all the issues in the repository that have piled up since November. Your approach seems very reasonable to me!

The Snakemake wrapper would explain the deletion of the folders, it's a Snakemake thing - if something errors it will delete any generated files.

With error 2, pharokka is assuming the locus tag is a float (not a string). Some bad coding by me to make this ambiguous.

While I can't reproduce the error (I've tried a few different ways), I have put in a fix that should hopefully resolve it. George

k6logc commented 8 months ago

Hi @gbouras13,

Awesome re: the v1.6 update, excited to update, good luck!
And thank you for rolling in a fix to address the error 2 issue and for mentioning about snakemake explaining the deletions.

Best wishes, Kathryn