bmurauer / pipelinehelper

scikit-helper to hot-swap pipeline elements
GNU General Public License v3.0
21 stars 9 forks source link

multimodels on text classification #5

Closed ahossanmarc closed 4 years ago

ahossanmarc commented 4 years ago

Hi, i want to build a text clasffication model but i have gotten errors

pipe = Pipeline([
    ('count_or_tf',PipelineHelper([
        ("count",CountVectorizer(tokenizer = spacy_tokenizer)),
        ("tfidf",TfidfVectorizer(tokenizer = spacy_tokenizer)),
    ])),
    ("clf",PipelineHelper([
        ('rf',RandomForestClassifier()),
        ("mnb",MultinomialNB()),
        ('svm',SVC())
    ]))

])

params = {
    'count_or_tf__selected_model': pipe.named_steps["count_or_tf"].generate({
        'count__ngram_range':[(1,1),(2,2),(3,3)],
        'count__max_df':[.01,.05,.1],
        'tf_idf__ngram_range':[.01,.05,.1],
        'tfidf__ngram_range':[(1,1),(2,2),(3,3)]
        }),
    }
grid =GridSearchCV(train_new.text,train.label,n_jobs=2,cv=2,verbose=1)

but i got this :ValueError: Parameter values for parameter (0) need to be a sequence(but not a string) or np.ndarray.

Could you help me ? i want to transform text into a matrix before applying models

bmurauer commented 4 years ago

i have not tried it, but unfortunately the generate() method must be called for each pipelinehelper element, even if you don't explicitly set any values for it. So your params should look like this:

params = {
    'count_or_tf__selected_model': pipe.named_steps["count_or_tf"].generate({
        'count__ngram_range': [(1,1),(2,2),(3,3)],
        'count__max_df': [.01,.05,.1],
        'tf_idf__ngram_range': [.01,.05,.1],
        'tfidf__ngram_range' :[(1,1), (2,2), (3,3)]
    }),
    'clf__selected_model': pipe.named_steps['clf'].generate()
}
bmurauer commented 4 years ago

Oh also, the parameter ngram_range expects a tuple as value, so the third line in your parameter list does not make sense I think. You have the correct values in the fourth line, so just delete the third one.

bmurauer commented 4 years ago

I'm closing this issue as you seem to have figured it out :-)