Parietal-INRIA / scikit-learn-magazine

Temporary private repo for collaborative work on scikit-learn SigMobile entry
1 stars 1 forks source link

A small NLP example #1

Closed GaelVaroquaux closed 10 years ago

GaelVaroquaux commented 10 years ago

We need a small NLP example. Maybe something adapted and simplified from the 20newsgroup examples of scikit-learn. We need something as simple as possible.Really the bare minimum, as people will get lost very quickly.

Maybe @ogrisel and @larsmans can help here, as they are our NLP experts.

larsmans commented 10 years ago

Here's a sentiment polarity classifier that I have running on couple of servers. It computes the probability that a movie review is positive.

urlretrieve('http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz')
with tarfile.open(temp.name) as tar:
    tar.extractall(path=data_dir)
data = load_files(os.path.join('movie_reviews', 'txt_sentoken'))

clf = make_pipeline(TfidfVectorizer(min_df=2, dtype=float,
                                    sublinear_tf=True, ngram_range=(1, 2),
                                    strip_accents='unicode'),
                     LogisticRegression(random_state=623, C=5000))
clf.fit(data.data, data.target)

(The tarball unpacking is not necessary for the example.) I'll look up the CV accuracy for this thing.

larsmans commented 10 years ago
In [13]: cross_val_score(clf, data.data, data.target, cv=5)
Out[13]: array([ 0.91  ,  0.8825,  0.88  ,  0.8775,  0.86  ])

Demo:

>>> clf.predict_proba(["This movie is the worst I ever saw."])[0, 0]  # negative probability
0.99929299972040175
>>> clf.predict_proba(["Shawshank Redemption, eat your heart out!"])[0, 1]  # positive probability
0.83124224591605844
GaelVaroquaux commented 10 years ago

That's a cool example. I like it a lot. Thanks a lot @larsmans !

GaelVaroquaux commented 10 years ago

Works as advertized on my box. Definitely awesome!