chkaich / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
0 stars 0 forks source link

-d parameter does not change length of description in .clstr file #4

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. choose a FASTA file with long headers as infile.fasta
1. run 'cd-hit -i infile.fasta -o outfile.fasta'
2. run 'cd-hit -i infile.fasta -o outfile2.fasta -d 100'
3. compare outfile.fasta.clstr with outfile2.fasta.clstr (e.g. in linux 'diff 
outfile{'',2}.fasta.clstr')

What is the expected output? What do you see instead?
Because of the -d 100 paramter, outfile2.fasta.clstr should contain a larger 
part of each header instead of just the first characters.
Both, outfile.fasta.clstr and outfile2.fasta.clstr contain only a very short 
part of each header, e.g.:

>Cluster 0
0       69aa, >1cc1_L... *
>Cluster 1
0       61aa, >2wpn_B... *

for these infile.fasta entries:
>1cc1_L Hydrogenase (large subunit); NI-Fe-Se hydrogenase, oxidoreduct 
(111-179:497)
QSHILHFYHLAALDYVKGPDVSPFVPRYANADLLTDRIKDGAKADATNTYGLNQYLKALEIRRICHEMV
>2wpn_B Periplasmic [nifese] hydrogenase, large subunit, selenocystein 
(116-176:494)
QSHILHFYHLSAQDFVQGPDTAPFVPRFPKSDLRLSKELNKAGVDQYIEALEVRRICHEMV

What version of the product are you using? On what operating system?
CD-HIT version 4.5.4 (built on May 31 2011)
Kubuntu 11.10 64bit

Original issue reported on code.google.com by klaus.ko...@gmail.com on 1 Feb 2012 at 11:58

GoogleCodeExporter commented 9 years ago
The -d feature has to be the most frustrating parameter of cd-hit. I use cd-hit 
alot in a number of applications. To not, be default print out at least the 
sequence description to the first base (instead of a default of 20, why??) 
makes no sense to me. To not allow the researcher to print the whole line, 
makes no sense to me. and if there are spaces in the sequence name, the -d 
behavior seems to get ignored and regardless of the choice of -d, cd-hit only 
seems to print to the first space. Please consider changing this behavior. 

Original comment by mattsett...@gmail.com on 17 Feb 2013 at 4:20

GoogleCodeExporter commented 9 years ago

Original comment by daoko...@gmail.com on 27 Mar 2013 at 11:52

GoogleCodeExporter commented 9 years ago
After some code diving, I've come up with a patch.
The FASTA entry description is stored incorrectly (only up to the first 
whitespace, as opposed to the first newline).
One line needs to change.

I've tested it and it now respects the -d switch.

Original comment by vladv...@gmail.com on 5 Oct 2013 at 2:55

Attachments: