CD-HIT is running forever

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.When I used from provean.sh. CD-HIT was called with option " -i 
/tmp/proveanEv9aLE/tmp.fasta -o /tmp/proveanEv9aLE/cdhit.cluster -c 0.75 -s 0.3 
-n 5 -l 158 -bak 1"
2.CD-HIT output "comparing sequences from          0  to          0" forever.
3.The input fasta file is attached this.

What is the expected output? What do you see instead?
I want to the result like normal clustering output, but If I use this input 
fasta file, I can not get the result.

What version of the product are you using? On what operating system?
====== CD-HIT version 4.6 (built on Feb  3 2013) ======
Redhat Enterprise Linux 5

Please provide any additional information below.

Original issue reported on code.google.com by kyosh...@nig.ac.jp on 29 Aug 2013 at 4:47

Attachments:

tmp.fasta

GoogleCodeExporter commented 9 years ago

I also have these problem, it occurs when adding the attached file to my fasta 
files for clustering. using cd-hit-est only with default parameters and -c 0.95

Original comment by Hoelzer....@gmail.com on 5 Nov 2013 at 2:43

Attachments:

spades_scaffolds.fasta.1000bp

GoogleCodeExporter commented 9 years ago

I've also experienced this problem with the attached file and the following 
command. The issue is present with versions 4.6 and 4.6.1.

cd-hit-est -i single_transcript_hits.fasta -o testout.fasta -d 0 -c 1.0 -p 1

Original comment by kweitem...@gmail.com on 15 Jul 2014 at 4:52

Attachments:

single_transcript_hits.fasta

GoogleCodeExporter commented 9 years ago

The problem is that CD-HIT's new (as of 4.6) way of computing the number of max 
sequences in Options::ComputeTableLimits doesn't account for huge values of 
max_entries (in cdhit-common.c++). However, the new way of calculating 
max_entries seems to generate staggeringly large values. It then calculates the 
ratio of max_sequences to max_entries, which ends up having an effective value 
of 0 given the colossal denominator, but then it never checks to see if 
max_sequences has a value of 0.  

This causes an infinite loop later in the program (SequenceDB::DoClustering, 
loop at line 2937): max_seqs is 0, so the first while loop is never executed, 
meaning m is never incremented.

The attached patch fixes it. I don't know if it's the author's original 
intention but it works. 

Also, note to the author: you changed the code to pass min_len to that function 
instead of NAAN, but then you don't do anything with it! Perhaps that's the 
source of the problem?

Original comment by inve...@ebi.ac.uk on 22 Jul 2014 at 11:27

Attachments:

cd-hit-v4.6.1-2012-08-27-fix-max-sequences.diff

GoogleCodeExporter commented 9 years ago

Note that there is a minor patch where minor changes (to improve the use of 
long sequences) by the author and it is available here:
https://code.google.com/p/cdhit/source/detail?r=784a6f1b5e1175722462afbb0906c77e
85dfa556

I added the above correction (post#3) to this new version of cdhit-common.c++. 
Remember to "make" again in the cd-hit directory after adding these corrections.

Original comment by t.dugede...@gmail.com on 8 Aug 2014 at 3:13

GoogleCodeExporter commented 9 years ago

Hi. Still having the infinite loop comparing sequences from 0 to 0. any other 
changes that could help? I looked at the diff files and recompiled based on 
those changes.

Original comment by solar...@uci.edu on 13 Feb 2015 at 11:24

JinfengChen / cdhit

CD-HIT is running forever #18