JinfengChen / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
0 stars 0 forks source link

cluster_merge.pl only partial merge? #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. file1.clstr = 7.0 Gb, clustered about 143 million DNA seq reads into 41 040 
442 clusters
2. file2.clstr = 3.6 Gb, output from CD-HIT-EST-2D, adding 21 million "shared" 
reads (the "new, unique" reads are in another file)
3. clstr_merge.pl file1.clstr file2.clstr > mergefile.clstr

What is the expected output? What do you see instead?
expected output should be a file with 164 million reads. Instead, I get a file 
representing about 150 million reads, thus about 14 million out of 21 million 
reads have gone missing. The mergefile.clstr is 7.3 Gb

What version of the product are you using? On what operating system?
CD-HIT4.5.4 on Linux Mint12, computer has two quadcore processors, 30Gb 
available RAM

Please provide any additional information below.

I tried to trace the source of the error, but I am not sure I found it.

At some point there are about 1,100 clusters in the file2.clstr where there is 
no new sequence added (i.e. a series of lines with >Cluster24966989 followed by 
a line with 0 188nt, >seqID_same_as_in_file1.clstr... * and no following 
line(s) with 1 188nt, >seqID_new_from_file2.clstr... +/100.00%). When then 
finally appears a set of lines including some seqIDs from the new dataset, they 
are not added to the new file. At least from that point onwards when I compare 
the file1, file2 and mergefile, I do not see any sequence from file2 that has 
been added. 

I suspect that some sequences have been omitted in earlier parts too, but the 
files are too big to systematically trace whether and where there have been 
additional omission in the earlier part of the file. Since only about 6 million 
sequences have been add to the first 24 million clusters, I am rather sceptical 
that the missing 14 million reads should be added to the last 17 million 
clusters.

Some random checking of clusters comparing the  file1, file2 and mergefile did 
not show any abnormalities in the earlier part of the result.

Is there some limitation on the number of lines that clstr_merge.pl can handle?

Original issue reported on code.google.com by hug...@ku.ac.th on 2 Oct 2012 at 10:53