biocore / taxster

taxster: assigning taxonomy to organisms you've never even heard of
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

use HashingVectorizer #2

Closed audy closed 10 years ago

audy commented 10 years ago

You've implemented your own k-mer counting.

scikit-learn has a great k-mer counting algorithm called the hashing vectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

See also: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

EDIT: I implemented a Naive Bayesian Classifier using SciKit-Learn for a presentation not long ago: https://github.com/audy/presentations/blob/master/02-26-2014-scikit_learn_for_biology/ipython-notebooks/16S%20rRNA%20Classifier%20(Text%20Based).ipynb (you might find this useful).

wasade commented 10 years ago

That's awesome, thanks! Will look deeper into it shortly

On Mon, Jun 9, 2014 at 11:39 AM, Austin Richardson <notifications@github.com

wrote:

You've implemented your own k-mer counting.

scikit-learn has a great k-mer counting algorithm called the hashing vectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

See also: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/2.

audy commented 10 years ago

Closing b/c of #3