Open blahah opened 10 years ago
BLAST uses 11 for nucleotide (alphabet size 4) and 3 for protein (alphabest size 22) by default. Bear in mind there are also ambiguous characters (e.g. N in nucleotides, X in amino acid).
USEARCH uses k=8 for nucleotide, k=5 for protein (http://bioinformatics.oxfordjournals.org/content/suppl/2010/08/11/btq461.DC1/supp_mat_rev2.pdf)
BLAT uses k=11 for nucleotides, k=4 for protein (http://genome.ucsc.edu/FAQ/FAQblat.html#blat1)
CD-HIT uses k=5 for protein (cd-hit) and k=10 for dna (cd-hit-est) (http://weizhongli-lab.org/cd-hit/wiki/doku.php?id=cd-hit_user_guide) A nice touch is the docs explain what k values will enable what %identity possibilities.
There is a new open-source alternative to UCLUST on github but I can't remember the name or find it!
MEGABLAST uses k=28 (is only dna).
It also has options for discontiguous seeds.
(http://blast.ncbi.nlm.nih.gov/blast/discontiguous.shtml)
@tseemann thanks for the added info :D
There is a new open-source alternative to UCLUST
thinking of https://github.com/torognes/vsearch?
Yes that's the one! Thanks - I just made a homebrew package for it.
Collect choices of k for nucleotide and amino acid searches from the literature.