Closed GaelVaroquaux closed 10 years ago
Here's a sentiment polarity classifier that I have running on couple of servers. It computes the probability that a movie review is positive.
urlretrieve('http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz')
with tarfile.open(temp.name) as tar:
tar.extractall(path=data_dir)
data = load_files(os.path.join('movie_reviews', 'txt_sentoken'))
clf = make_pipeline(TfidfVectorizer(min_df=2, dtype=float,
sublinear_tf=True, ngram_range=(1, 2),
strip_accents='unicode'),
LogisticRegression(random_state=623, C=5000))
clf.fit(data.data, data.target)
(The tarball unpacking is not necessary for the example.) I'll look up the CV accuracy for this thing.
In [13]: cross_val_score(clf, data.data, data.target, cv=5)
Out[13]: array([ 0.91 , 0.8825, 0.88 , 0.8775, 0.86 ])
Demo:
>>> clf.predict_proba(["This movie is the worst I ever saw."])[0, 0] # negative probability
0.99929299972040175
>>> clf.predict_proba(["Shawshank Redemption, eat your heart out!"])[0, 1] # positive probability
0.83124224591605844
That's a cool example. I like it a lot. Thanks a lot @larsmans !
Works as advertized on my box. Definitely awesome!
We need a small NLP example. Maybe something adapted and simplified from the 20newsgroup examples of scikit-learn. We need something as simple as possible.Really the bare minimum, as people will get lost very quickly.
Maybe @ogrisel and @larsmans can help here, as they are our NLP experts.