Closed Smolky closed 2 years ago
Hi Smolky, thank you for reporting this issue. I'm afraid that I don't exactly know how dask is managing the parallelism in detail. In order to debug this, could you please add this information:
classifier
variable?The error itself merely states that some part of your script is trying to access the variable classes_
of your model (I suspect this happens at the evaluation step of a split? Could you check this?). Technically, your model in this pipeline is 'just' the PipelineHelper object, which tries to delegate this property call to the actual model. In your case, the actual model does not have a 'classes_' property. It could be that dask is losing some information of the model when copying/forking/merging stuff, I am not sure about this.
Dear BMurauer, thanks for the quick response:
The classifier variables is one of these:
('mnb_classifier', MultinomialNB ()),
('lr', LogisticRegression (max_iter = 4000)),
('svm', SVC (probability = True)),
('k_classifier', KNeighborsClassifier (n_neighbors = 2)),
('j48', DecisionTreeClassifier ()),
('rf', RandomForestClassifier (bootstrap = False, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 2))
Regarding the parameters, it is a bit more complicated. As I want the results for each method and n_gram size separately (this is for research purposes), I put all of these inside three nested loops (one for the model, another for selecting between char_grams or word_n_grams, and the last one for the size)
# @var classifier_hyperparameters Filter only those parameters related to the classifiers we use
classifier_hyperparameters = {
key: hyperparameter
for key, hyperparameter in hyperparameters.items ()
if key.startswith (tuple ([(classifier_key[0] + "__") for classifier_key in classifiers]))
}
# @var parameters Dictionary
parameters = {
'classifier__selected_model': pipe.named_steps['classifier'].generate (classifier_hyperparameters)
}
# Create the specific bag of word features from unigrams to trigrams
features = {
'features__analyzer': [analyzer],
'features__ngram_range': [features__ngram_range]
}
# Mix the specific and generic parameters for the character n-grams and the word-grams
features = {**features, **features_options}
# Mix the features with the classifier parameters
features['classifier__selected_model'] = pipe.named_steps['classifier'].generate (classifier_hyperparameters)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = [features]
Next...
# @var search RandomizedSearchCV
search = sklearn.model_selection.RandomizedSearchCV (pipe, param_grid,
cv = split,
n_iter = n_iter,
scoring = scoring_metric,
random_state = bootstrap.seed,
refit = True
)
Unfortunately, I don't have the resources to dig into this issue at the moment. I am also still not quite sure how your complete setup works, or why you nest loops in order to test multiple setups, which should be working out-of-the-box using the parameter grid, I think. If you could add a minimal self-contained example demonstrating the error, I might have some opportunities to look into it.
Here's an example:
from sklearn.datasets import make_regression
X, y = make_regression()
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Lasso, Ridge
pipe = Pipeline([
('scaler', PipelineHelper([
('std', StandardScaler()),
])),
('clf', PipelineHelper([
('lr', LinearRegression()),
('lasso', Lasso()),
('ridge', Ridge()),
])),
])
params = {
'scaler__selected_model': pipe.named_steps['scaler'].generate({
'std__with_mean': [True, False],
}),
'clf__selected_model': pipe.named_steps['clf'].generate({
'lasso__alpha': [1e-2, 1e-1],
'ridge__alpha': [1e-2, 1e-1],
}),
}
from dask_ml.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, scoring='neg_mean_absolute_error')
grid.fit(X, y)
This would work if you change
from dask_ml.model_selection import GridSearchCV
to
from sklearn.model_selection import GridSearchCV
but then you don't get the benefits of dask.
The error is ValueError: selected model does not provide classes_
no matter if doing a regression or classification problem.
Ah I see, thanks for the example.
It seems that the current implementation depends on how sklearn clones the models internally, and dask
does it some other way.
Unfortunately, I don't have the resources to look further into this, but I'd welcome a pull request.
Hello. Few months ago, I combined pipelinehelper with dask and RandomizedSearchCV
However, after a problem with my computer in which I have to reinstall the virtual enviroment. every time I try to run I get the following error:
ValueError: selected model does not provide classes_
I am not sure if this problem arises from a recent update of pipelinehelper, dask or it is my mistake.
This is my pipeline:
And these are the parameters I am trying for a Support Vector Machine model
However, when I switch from dask to sklearn.model_selection.RandomizedSearchCV everything seems to work fine (but I lost distributed running)
What does "`ValueError: selected model does not provide classes" means?. I am not sure about it by searching the source code
Kind regards!