Pharokka fails on large dataset

nelsonruth11 commented 7 months ago

pharokka version: 1.3.2
Python version: 3.10.8
Operating System: Unix on an HPC
Installed with conda

Description

I've successfully used pharokka dozens of times, same version, same script, same HPC node/resource allocation. I tried running it on a large dataset (~490k contigs, N50=50kbp, 18GB size) with this command:

pharokka.py -i $fasta -o $output_dir -d $db -t 14 -m -p img_hq

I know my file paths are correct, but I get this output:

Traceback (most recent call last): File "/fake/path/username/miniconda3/envs/pharokka/bin/phanotate.py", line 55, in <module> shortest_path = fz.get_path(source=source, target=target) TypeError: No path to target

I then subsetted the original dataset to 1,000 contigs and it worked just fine. This error typically show up after an hour or so, and the job gets hung. There are folders in the output directory of the failed run: input_split_tmp, single_gbks, single_gffs, as well as the log.

Here are the log contents: 2023-12-06 09:57:45,486 - INFO - Starting pharokka v1.3.2 2023-12-06 09:57:45,486 - INFO - Input args: Namespace(infile='/fake/path/img_vr-20230705/hq_contigs.fa', outdir='/fake/path/analysis/pharokka/pharokka_out/img_hq', database='/fake/path/analysis/pharokka/db', threads='14', force=False, prefix='img_hq', locustag='Default', gene_predictor='phanotate', meta=True, split=False, coding_table='11', evalue='1E-05', terminase=False, terminase_strand='nothing', terminase_start='nothing', citation=False) 2023-12-06 09:57:45,487 - INFO - Checking database installation. 2023-12-06 09:57:45,517 - INFO - All databases have been successfully checked. 2023-12-06 09:57:45,517 - INFO - Checking dependencies. 2023-12-06 09:57:45,722 - INFO - phan_out version found is v1.5.1. 2023-12-06 09:57:45,722 - INFO - Phanotate version is ok. 2023-12-06 09:57:46,148 - INFO - MMseqs2 version found is v13.45111. 2023-12-06 09:57:46,148 - INFO - MMseqs2 version is ok. 2023-12-06 09:57:46,286 - INFO - tRNAscan-SE version found is v2.45111.12. 2023-12-06 09:57:46,286 - INFO - tRNAscan-SE version is ok. 2023-12-06 09:57:47,730 - INFO - MinCED version found is v0.4.2. 2023-12-06 09:57:47,730 - INFO - MinCED version is ok. 2023-12-06 09:57:47,735 - INFO - ARAGORN version found is v1.2.41. 2023-12-06 09:57:47,735 - INFO - ARAGORN version is ok. 2023-12-06 09:57:47,960 - INFO - mash version found is v2.3. 2023-12-06 09:57:47,960 - INFO - mash version is ok. 2023-12-06 09:58:45,185 - INFO - 486691 input contigs detected. 2023-12-06 13:24:58,643 - INFO - Running Phanotate. 2023-12-06 13:24:58,649 - INFO - Applying meta mode.

My first guess is that I'm running out of memory, which can be avoided easily enough, but I was wondering if you've seen this bug before/had some potential fixes.

Cheers and thanks for this great tool! -Nelson

gbouras13 commented 7 months ago

Hi @nelsonruth11 ,

Unreal, this is an awesome use of Pharokka!

I am very sure that your error is caused by large likely non-phage (probably bacterial?) contigs larger than 1-2MB in your contig set. I have run Phanotate/Pharokka on large contigs just to test it out before and it breaks like what you report.

With such a big dataset, honestly I would recommend you update pharokka to v1.5.1 and specify -g prodigal-gv. This will let you detect stop codon recoding in some contigs (for more see https://github.com/apcamargo/prodigal-gv ).

In 490k contigs, there are almost certain to be some!

George

nelsonruth11 commented 7 months ago

Hmm. I'm skeptical about bacterial genome presence; the dataset consists of only high-confidence and high-quality viral genomes from the IMG-VR DB v4.1 (>= 90% completion and genomes have at least 3 points from: 3 points: Clustered with a RefSeq r213 virus genome in a vOTU; three or more geNomad virus hallmark markers 2 points: High-confidence CheckV AAI completeness estimate; two geNomad virus hallmark markers 1 point: Medium-confidence CheckV AAI completeness estimate; one geNomad virus hallmark marker; direct or inverted terminal repeats; two or more matches to CRISPR spacers). They've all been given a fairly deep viral taxonomic classification, but some non-viral genomes could've slipped through the cracks.

Is the issue non-viral genomes or large genomes? The largest genome in this dataset is 1,152,313 bps (!), with ~1,700 Megaviricetes in the dataset. I'd like to upgrade the version, but I am on a time crunch for a paper that uses v1.3.2 in several analyses and I'd rather avoid redoing them with a newer version.

I think my next move is to run it until failure, then dig into those output files to see which genome is causing the error.

Thanks for your help, -Nelson

gbouras13 commented 7 months ago

The issue is with the length, not the bacterialness of the genomes (I perhaps naively assumed a few 2MB+ had snuck through and assumed they were bacteria - but clearly perhaps we have a new 1.1MB phage?!? Keen to read the paper/preprint when you're done).

I'd split your datasets, and take all contigs under 500kb, and run them. Should be no issue. Then run all the ones over 500kbp separately (perhaps with more memory/time).

One comment is for 490k, phanotate is going to take a very long time as it just is a bit slow vs progidal - you might be better off with -g prodigal for this use case even sticking with v1.3.2 (and this problem will go away!).

George

gbouras13 commented 5 months ago

Closing this as I assume you have gotten around this issue @nelsonruth11

gbouras13 / pharokka

Pharokka fails on large dataset #308

Description