blahah / better-blast

An evolving prototype experimenting with some ways to improve on BLAST. If it works, we'll engineer an industrial-strength suite.
9 stars 2 forks source link

Collect historical k-choices #2

Open blahah opened 10 years ago

blahah commented 10 years ago

Collect choices of k for nucleotide and amino acid searches from the literature.

blahah commented 10 years ago

BLAST uses 11 for nucleotide (alphabet size 4) and 3 for protein (alphabest size 22) by default. Bear in mind there are also ambiguous characters (e.g. N in nucleotides, X in amino acid).

blahah commented 10 years ago

USEARCH uses k=8 for nucleotide, k=5 for protein (http://bioinformatics.oxfordjournals.org/content/suppl/2010/08/11/btq461.DC1/supp_mat_rev2.pdf)

blahah commented 10 years ago

BLAT uses k=11 for nucleotides, k=4 for protein (http://genome.ucsc.edu/FAQ/FAQblat.html#blat1)

tseemann commented 9 years ago

CD-HIT uses k=5 for protein (cd-hit) and k=10 for dna (cd-hit-est) (http://weizhongli-lab.org/cd-hit/wiki/doku.php?id=cd-hit_user_guide) A nice touch is the docs explain what k values will enable what %identity possibilities.

tseemann commented 9 years ago

There is a new open-source alternative to UCLUST on github but I can't remember the name or find it!

tseemann commented 9 years ago

MEGABLAST uses k=28 (is only dna).
It also has options for discontiguous seeds. (http://blast.ncbi.nlm.nih.gov/blast/discontiguous.shtml)

blahah commented 9 years ago

@tseemann thanks for the added info :D

There is a new open-source alternative to UCLUST

thinking of https://github.com/torognes/vsearch?

tseemann commented 9 years ago

Yes that's the one! Thanks - I just made a homebrew package for it.