What steps will reproduce the problem?
1. file1.clstr = 7.0 Gb, clustered about 143 million DNA seq reads into 41 040
442 clusters
2. file2.clstr = 3.6 Gb, output from CD-HIT-EST-2D, adding 21 million "shared"
reads (the "new, unique" reads are in another file)
3. clstr_merge.pl file1.clstr file2.clstr > mergefile.clstr
What is the expected output? What do you see instead?
expected output should be a file with 164 million reads. Instead, I get a file
representing about 150 million reads, thus about 14 million out of 21 million
reads have gone missing. The mergefile.clstr is 7.3 Gb
What version of the product are you using? On what operating system?
CD-HIT4.5.4 on Linux Mint12, computer has two quadcore processors, 30Gb
available RAM
Please provide any additional information below.
I tried to trace the source of the error, but I am not sure I found it.
At some point there are about 1,100 clusters in the file2.clstr where there is
no new sequence added (i.e. a series of lines with >Cluster24966989 followed by
a line with 0 188nt, >seqID_same_as_in_file1.clstr... * and no following
line(s) with 1 188nt, >seqID_new_from_file2.clstr... +/100.00%). When then
finally appears a set of lines including some seqIDs from the new dataset, they
are not added to the new file. At least from that point onwards when I compare
the file1, file2 and mergefile, I do not see any sequence from file2 that has
been added.
I suspect that some sequences have been omitted in earlier parts too, but the
files are too big to systematically trace whether and where there have been
additional omission in the earlier part of the file. Since only about 6 million
sequences have been add to the first 24 million clusters, I am rather sceptical
that the missing 14 million reads should be added to the last 17 million
clusters.
Some random checking of clusters comparing the file1, file2 and mergefile did
not show any abnormalities in the earlier part of the result.
Is there some limitation on the number of lines that clstr_merge.pl can handle?
Original issue reported on code.google.com by hug...@ku.ac.th on 2 Oct 2012 at 10:53
Original issue reported on code.google.com by
hug...@ku.ac.th
on 2 Oct 2012 at 10:53