ValueError: selected model does not provide classes_ with dask_ml.model_selection

Smolky commented 2 years ago

Hello. Few months ago, I combined pipelinehelper with dask and RandomizedSearchCV

However, after a problem with my computer in which I have to reinstall the virtual enviroment. every time I try to run I get the following error: ValueError: selected model does not provide classes_

I am not sure if this problem arises from a recent update of pipelinehelper, dask or it is my mistake.

This is my pipeline:

    pipe = Pipeline ([
        ('features', TfidfVectorizer ()),
        ('select', PipelineHelper ([
            ('vt', VarianceThreshold ()),
            ('skbest', SelectKBest ()),
        ])),
        ('classifier', PipelineHelper ([classifier]))
    ])

And these are the parameters I am trying for a Support Vector Machine model

{
  'features__analyzer': ['word'], 
  'features__ngram_range': [(1, 1)], 
  'features__sublinear_tf': [True, False], 
  'features__strip_accents': [None, 'unicode'], 
  'features__use_idf': [True, False], 
  'max_df': [0.01, 0.1, 1], 
  'classifier__selected_model': [
    ('svm', {'C': 1, 'kernel': 'rbf'}), 
    ('svm', {'C': 1, 'kernel': 'poly'}), 
    ('svm', {'C': 1, 'kernel': 'linear'})
  ]
}

However, when I switch from dask to sklearn.model_selection.RandomizedSearchCV everything seems to work fine (but I lost distributed running)

What does "`ValueError: selected model does not provide classes" means?. I am not sure about it by searching the source code

Kind regards!

bmurauer commented 2 years ago

Hi Smolky, thank you for reporting this issue. I'm afraid that I don't exactly know how dask is managing the parallelism in detail. In order to debug this, could you please add this information:

what is the content of your classifier variable?
how do you generate the displayed parameters? If possible, I think I can be more helpful if you could post your entire training procedure of the script.

The error itself merely states that some part of your script is trying to access the variable classes_ of your model (I suspect this happens at the evaluation step of a split? Could you check this?). Technically, your model in this pipeline is 'just' the PipelineHelper object, which tries to delegate this property call to the actual model. In your case, the actual model does not have a 'classes_' property. It could be that dask is losing some information of the model when copying/forking/merging stuff, I am not sure about this.

Smolky commented 2 years ago

Dear BMurauer, thanks for the quick response:

The classifier variables is one of these:

('mnb_classifier', MultinomialNB ()),
('lr', LogisticRegression (max_iter = 4000)),
('svm', SVC (probability = True)),
('k_classifier', KNeighborsClassifier (n_neighbors = 2)),
('j48', DecisionTreeClassifier ()),
('rf', RandomForestClassifier (bootstrap = False, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 2))

Regarding the parameters, it is a bit more complicated. As I want the results for each method and n_gram size separately (this is for research purposes), I put all of these inside three nested loops (one for the model, another for selecting between char_grams or word_n_grams, and the last one for the size)

# @var classifier_hyperparameters Filter only those parameters related to the classifiers we use
classifier_hyperparameters = {
    key: hyperparameter 
    for key, hyperparameter in hyperparameters.items () 
    if key.startswith (tuple ([(classifier_key[0] + "__") for classifier_key in classifiers]))
}

# @var parameters Dictionary
parameters = {
    'classifier__selected_model': pipe.named_steps['classifier'].generate (classifier_hyperparameters)
}

# Create the specific bag of word features from unigrams to trigrams
features = {
    'features__analyzer': [analyzer],
    'features__ngram_range': [features__ngram_range]
}

# Mix the specific and generic parameters for the character n-grams and the word-grams
features = {**features, **features_options}

# Mix the features with the classifier parameters
features['classifier__selected_model'] = pipe.named_steps['classifier'].generate (classifier_hyperparameters)

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = [features]

Next...

# @var search RandomizedSearchCV
search = sklearn.model_selection.RandomizedSearchCV (pipe, param_grid, 
    cv = split, 
    n_iter = n_iter, 
    scoring = scoring_metric, 
    random_state = bootstrap.seed,
    refit = True
)

bmurauer commented 2 years ago

Unfortunately, I don't have the resources to dig into this issue at the moment. I am also still not quite sure how your complete setup works, or why you nest loops in order to test multiple setups, which should be working out-of-the-box using the parameter grid, I think. If you could add a minimal self-contained example demonstrating the error, I might have some opportunities to look into it.

MichaelHopwood commented 2 years ago

Here's an example:

from sklearn.datasets import make_regression

X, y = make_regression()

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Lasso, Ridge

pipe = Pipeline([
    ('scaler', PipelineHelper([
        ('std', StandardScaler()),
    ])),
    ('clf', PipelineHelper([
        ('lr', LinearRegression()),
        ('lasso', Lasso()),
        ('ridge', Ridge()),
    ])),
])

params = {
    'scaler__selected_model': pipe.named_steps['scaler'].generate({
        'std__with_mean': [True, False],
    }),
    'clf__selected_model': pipe.named_steps['clf'].generate({
        'lasso__alpha': [1e-2, 1e-1],
        'ridge__alpha': [1e-2, 1e-1],
    }),
}

from dask_ml.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, scoring='neg_mean_absolute_error')
grid.fit(X, y)

This would work if you change from dask_ml.model_selection import GridSearchCV to from sklearn.model_selection import GridSearchCV but then you don't get the benefits of dask.

The error is ValueError: selected model does not provide classes_ no matter if doing a regression or classification problem.

bmurauer commented 2 years ago

Ah I see, thanks for the example.

It seems that the current implementation depends on how sklearn clones the models internally, and dask does it some other way.

Unfortunately, I don't have the resources to look further into this, but I'd welcome a pull request.

bmurauer / pipelinehelper

ValueError: selected model does not provide classes_ with dask_ml.model_selection #15