Text Classification with NLTK and Scikit-Learn

bbengfort commented 8 years ago

This is a simple text classification blog post - quick and easy!

AtomicSpider commented 7 years ago

Hi, First of all, thanks for the code. It's really helpful.

I'm getting an error while training on a custom dataset.

Error:

vectorizer = model.named_steps['vectorizer']
AttributeError: 'tuple' object has no attribute 'named_steps'

Dataset: X:

['What is the name of this asset?',
 'What is this asset called?',
 'Is the asset healthy?',
 "How is the asset's health?",
 'Show me asset info.',
 'Show me asset information.',
 'Tell me about the asset.',
 'Show me asset history.']

Y:

['asset_name',
 'asset_name',
 'asset_health',
 'asset_health',
 'asset_info',
 'asset_info',
 'asset_info',
 'asset_history']

bbengfort commented 7 years ago

Well, that error is telling you that your model is a tuple not a Pipeline object - make sure you instantiate a pipeline, something to the effect of:

from sklearn.pipeline import Pipeline 

model = Pipeline([
    ('vectorizer', TfidfVectorizer()), 
    ('classifier', NaiveBayes()), 
])

mjahanshahi commented 7 years ago

Hi Ben!

I'm having trouble unpickling a model that I tried with this classifier. I am a python newbie but I suspect I need to pickle the class as well? Can you suggest how I could do that?

bbengfort commented 7 years ago

@mjahanshahi just pickle the entire object:

import pickle 

model = Pipeline([
    ('preprocessor', NLTKPreprocessor()),
    ('vectorizer', TfidfVectorizer(
        tokenizer=identity, preprocessor=None, lowercase=False
    )),
    ('classifier', MultinomialNB()),
])

model.fit(docs, labels) 

with open('bayes_model.pkl', 'wb') as f:
    pickle.dump(model, f)

mjahanshahi commented 7 years ago

Thanks @bbengfort I keep getting the following error message when I unpickle using your directions: AttributeError: Can't get attribute 'NLTKPreprocessor' on <module 'main' > Does the class need to be pickled somehow?

mjahanshahi commented 7 years ago

I just realized what I was doing wrong. I was trying to load the model in a different environment (different python script) so none of the components were being inherited? I was doing this because I wanted to run some side by side tests of different classifiers but I think I know better now.

Thanks for your reply though!

bbengfort commented 7 years ago

Ok, glad you figured it out!

On Fri, Aug 4, 2017 at 2:03 PM, Maryam notifications@github.com wrote:

I just realized what I was doing wrong. I was trying to load the model in a different environment (different python script) so none of the components were being inherited? I was doing this because I wanted to run some side by side tests of different classifiers but I think I know better now.

Thanks for your reply though!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bbengfort/bbengfort.github.io/issues/4#issuecomment-320315447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAth7tWjAZvBqPZcVpHZn4ELKMD9GHV5ks5sU10MgaJpZM4IiMYG .

shadowcode92 commented 7 years ago

Hi Sir thank you for this code. I got one error during prediction. "print(model.namedsteps['classifier'].labels.inversetransform(yhat)) AttributeError: 'SGDClassifier' object has no attribute 'labels' " Could you please help me to find out solution for this.

bbengfort commented 7 years ago

When you look at the build_and_evaluate function, there is a line of code:

model, secs = build(classifier, X, y)
model.labels_ = labels

That happens after the model build is complete. This attribute is the LabelEncoder object, which allows for the inverse transform; it seems like this step is missing from your code if there is no attribute labels_.

drorata commented 6 years ago

@bbengfort Very nice post! One remark/question. What do you mean by an identity function. Would it be something like lambda x: x? Correct me if I'm wrong, you need this as the documents in the corpus are already preprocessed using NLTKPreprocessor, right?

bbengfort commented 6 years ago

@drorata that's correct on all counts! lambda x: x is a good example of an identity function and the reason you need to specify this is because your text is already preprocessed, so the identity function is called by Scikit-Learn and passes the tokens right through to be vectorized.

njanmo commented 6 years ago

Hi, thanks for this awesome intro to NLTK, I was wondering if there was a simple way to adapt this to process tweets from the twitter_sample corpus / other corpora? (apologies for the basic question, I am a python newbie)

bbengfort commented 6 years ago

Yes, in the step that says from nltk.corpus import movie_reviews as reviews - simply import the corpus you'd like to use; make sure it has a .raw() method -- this returns the string containing the text and you should be good to go.

roomm commented 6 years ago

Same error as @shadowcode92 : "print(model.namedsteps['classifier'].labels.inversetransform(yhat)) AttributeError: 'SGDClassifier' object has no attribute 'labels' "

and the line model.labels_ = labels is present.

bbengfort commented 6 years ago

@roomm please see my response to @shadowcode92 -- there is a line of code in the build_and_evaluate function that assigns the labels_ attribute to keep track of the LabelEncoder this is just a handy shortcut, you can directly use labels.inverse_transform(yhat) if you'd prefer.

bbengfort / bbengfort.github.io

Text Classification with NLTK and Scikit-Learn #4