genometools / genometools

GenomeTools genome analysis system.
http://genometools.org
Other
284 stars 65 forks source link

Aborted (core dumped) with LTR harvest #999

Open xiaoxiaonao opened 2 years ago

xiaoxiaonao commented 2 years ago

Problem description

While using LTRharvest this error pops up:

Assertion failed: (refrng.start <= boundaries->leftLTR_5), function gt_removeoverlapswithlowersimilarity, file src/ltr/ltrharvest_stream.c, line
1222.
This is a bug, please report it at
https://github.com/genometools/genometools/issues
Please make sure you are running the latest release which can be found at
http://genometools.org/pub/
You can check your version number with `gt -version`.
Aborted (core dumped)

Exact command line call triggering the problem

gt suffixerator -db Ps_genome.part-05.fasta  -indexname Ps_genome.part-05 -tis -suf -lcp -des -ssp -sds -dna

After creating the index, submit the following command:

gt ltrharvest -index Ps_genome.part-05 -minlenltr 100 -maxlenltr 3000 -similar 80 -gff3 Ps_genome.part-05_inner.fa > Ps_genome.part-05_harvest.scn

Example minimal input triggering the problem

What GenomeTools version are you reporting an issue for (as output by gt -version)?

gt (GenomeTools) 1.6.2 Copyright (c) 2003-2016 G. Gremme, S. Steinbiss, S. Kurtz, and CONTRIBUTORS Copyright (c) 2003-2016 Center for Bioinformatics, University of Hamburg See LICENSE file or http://genometools.org/license.html for license details.

Used compiler: cc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) Compile flags: -g -Wall -Wunused-parameter -pipe -fPIC -Wpointer-arith -Wno-unknown-pragmas -O3 -Werror

Did you compile GenomeTools from source? If so, please state the make parameters used.

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

CentOS Linux 8 (Core)

satta commented 2 years ago

Hi, thanks for reporting this. To properly reproduce the error and determine the root cause, though, I need the input sequence you used (Ps_genome.part-05.fasta). It would be great if you could provide this file, or, alternatively, a snippet of this sequence that triggers this issue without having to reveal too much of your input.

I also noticed that your value for maxlenltr is very large (3000). Could you try to also adjust mindistltr to account for that and to prevent overlapping LTRs? In this case it should be at least 3000. If that helps, then maybe LTRharvest should check this condition at the start.

xiaoxiaonao commented 2 years ago

Dear Sascha, Attached is the input file(Ps_genome.part-05.fasta.gz)  .  The file is over 2G when unzipped.   ------------------ Original ------------------ From: @.>; Date:  Tue, Jan 4, 2022 06:17 PM To: @.>; Cc: @.>; @.>; Subject:  Re: [genometools/genometools] Aborted (core dumped) with LTR harvest (Issue #999)

 

Hi, thanks for reporting this. To properly reproduce the error and determine the root cause, I would need the input sequence you used (Ps_genome.part-05.fasta). It would be great if you could provide this file, or, alternatively, a snippet of this sequence that triggers this issue without having to reveal too much of your input.

I also noticed that your value for maxlenltr is very large (3000). Could you try to also adjust mindistltr to account for that and to prevent overlapping LTRs? In this case it should be at least 3000. If that helps, then maybe LTRharvest should check this condition at the start.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

从腾讯企业邮箱发来的超大附件

Ps_genome.part-05.fasta.gz (586.6M, 2022年02月03日 19:25 到期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&amp;k=353934376f586fe23f1621ab45660e0b4a0b555656550b5b561403545153140d550b061a5b52095c480a5654525408005d01565306662339354a6b50070856540017445610121409501752561112581702433409&amp;code=e947bf99&amp;fid=72/2aa432b3-7c35-4022-940e-3bc021988bdd

satta commented 2 years ago

Thanks, I downloaded the file and will try to reproduce the issue. LTRharvest is running quite long... have you masked all short and tandem repeats before running LTRharvest? Otherwise the seed hits will explode, unnecessarily blowing up the run time.

xiaoxiaonao commented 2 years ago

I have not marked any short and tandem repeats before running LTRharvest. The error was reported after two weeks of operation. It is also difficult to annotate tandem repeats due to their length.

satta commented 2 years ago

Ouch, I see. Two weeks -- LTRharvest definitely should never run that long! I would strongly advise to at least use RepeatMasker to mask low-complexity repeats in the source. It is not recommended to just run LTRharvest on the raw sequence if there are many and long instances of such repeats. With the default seed size of 30 these will lead to lots of potential candidate pairs to be evaluated, which will excessively inflate the run time. You likely need to prepare the input sequence a bit.

My suggestion:

Regarding the original error: I am afraid I will not be able to run the software for two weeks each time I need to reproduce the error as I don't have a compute farm at my disposal any more. Is there any way you could come up with a smaller sequence stretch that triggers the issue?