AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
204 stars 25 forks source link

hmmsearch issue with large target sequence #50

Closed gavinmdouglas closed 2 years ago

gavinmdouglas commented 2 years ago

Hi there,

Thanks for making this extremely useful and well-documented tool!

I noticed that a genome I happened to know was very high quality based on checkm was dropped by GToTree due to no SCGs being identified. I looked into further and it was because this error occurred during the hmmsearch stage:

Fatal exception (source file p7_pipeline.c, line 697):
Target sequence length > 100K, over comparison pipeline limit.
(Did you mean to use nhmmer/nhmmscan?)
/home/gdouglas/local/miniconda3/envs/gtotree/bin/gtt-fasta-parallel.sh: line 70: 3753983 Aborted                 (core dumped) hmmsearch --cut_ga --cpu $num_cpus --tblout $
{tmp_dir}/${assembly}_curr_hmm_hits.tmp $hmm_file ${tmp_dir}/${assembly}_genes.tmp > /dev/null

This was with this assembly: https://www.ncbi.nlm.nih.gov/assembly/GCF_001865985.1 (and two other test E. coli genomes with the default Bacteria HMMs, and all default parameters).

There may not be anything to do on the GToTree side of things, as this seems to be internal to hmmsearch, but I just thought I would mention it here in case others run into the same error. One thing to note is that it took me a while to figure out what the problem was, and I would have just assumed that genome was very incomplete if I hadn't known otherwise, which could mean that on rare occasions genomes are left out of the final phylogeny due to this problem.

All the best,

Gavin

gavinmdouglas commented 2 years ago

See this related issue in VirSorter: https://githubhot.com/repo/jiarong/VirSorter2/issues/81

AstrobioMike commented 2 years ago

Hey there, Gavin :)

Thanks so much for the kind words and for letting me know about this!

Sorry you had to track things down, I'm very grateful you did though!

I will for sure go through and add in catches for intermediate steps like this, that is terrible this can happen right now

For this particular issue, since it's a hard-coded HMMER limitation like you mentioned (and noted by them here: https://github.com/EddyRivasLab/hmmer/issues/244), I'm thinking of putting in a filtering step to remove any proteins this big prior to running the hmm searching step. That way that genome as a whole still goes through. Does that sound good to you?

Thanks again for figuring out what was going on and writing in about this!

-Mike

gavinmdouglas commented 2 years ago

Hi Mike,

That sounds like a very reasonable change to remove those proteins. I wasn't sure if GToTree was calling proteins or if that was something hmmsearch was doing, but definitely that sounds great to me.

Cheers,

Gavin

AstrobioMike commented 2 years ago

This is addressed as of version 1.6.32, which is updated in conda 👍

Thanks again, Gavin!

gavinmdouglas commented 2 years ago

That must be a record! Thanks for the super fast support