Open sydneyhenrard opened 5 years ago
This seems to indicate that the results of the jobs are empty or None. Are you sure your data set is not empty?
I tried with the parameter parallelize=False and I don't have the issue.
Furthermore the result for best parameters seems not to work. It seems to display the last tried parameters.
opt.get_best_params
<bound method GridSearch.get_best_params of GridSearch(cv_folds=3,
model=RandomForestRegressor(bootstrap=True, criterion='mse',
max_depth=None, max_features=0.33,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=100,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=40, n_jobs=None,
oob_score=False, random_state=0,
verbose=0, warm_start=False),
num_threads=8, parallelize=False,
param_grid={'bootstrap': [True],
'max_features': [0.5, 'sqrt', 'log2', 0.33],
'min_samples_leaf': [1, 3, 5, 10, 25, 100],
'n_estimators': [40]},
seed=0)>
To verify I ran the model on best params, and other params
m = RandomForestRegressor(n_estimators=40,
min_samples_leaf=100,
max_features=0.33,
n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m, X_train, y_train, X_valid, y_valid)
m = RandomForestRegressor(n_estimators=40,
min_samples_leaf=3,
max_features=0.5,
n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m, X_train, y_train, X_valid, y_valid)
As you can see the second model is better (OOB is the last number) but it's not the best params.
Wall time: 6.61 s
[0.2593896702497389, 0.2768686574114764, 0.8593822384678886, 0.8631024854156082, 0.852173401851975]
Wall time: 12.3 s
[0.12585560380471036, 0.2279723348842033, 0.9668960405592746, 0.9071862610046989, 0.9084671021618435]
Maybe I am missing something on how to use the library
Hm, that's interesting. If that's correct (and parallelize=False returns the last tried model, not the best one) then I'd say that that's a bug. Pinging @cgnorthcutt
Hey folks, appreciate the insights here. I'm fully booked with the upcoming ML paper deadlines for the fall. If you can take a stab at a PR I'll take a look, but to figure it out myself might be some time, just a heads up.
A couple things to check:
same here
Hi, can you please post complete code to reproduce as simply as possibly. @mouadriyad
I had the exact same problem at my first time with hypopt.
svm_cls_rbf = svm.SVC(kernel='rbf')
param_grid = {'C': [10, 100, 200, 500, 750, 1000, 10000], 'gamma': [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4]}
grid_search = GridSearch(model=svm_cls_rbf, param_grid=param_grid, num_threads=8, parallelize=True)
grid_search.fit(feature_train, y_train, feature_validation, y_validation)
best_parameters = grid_search.best_estimator_.get_params()
The problem is that each fitted model is connected with all the previously calculated models (inheritance) and overrides them all. Therefere the scores stay the same, but the models end up being identical.
What worked for me is:
def _run_thread_job(model_params):
...
return model, score # don't return a tuple
and then exchange:
def fit(...):
...
if self.parallelize:
results = _parallel_param_opt(params, self.num_threads)
else:
#results = [_run_thread_job(job) for job in params] ##old
models = []
scores = []
for i in range(len(params)):
if i%50==0:
print(f'Nr of Model: {i}')
a,b = _run_thread_job(params[i])
models.append(sklearn.base.clone(a, safe=True))
scores.append(b)
#print(results) ##old
#models, scores = list(zip(*results))## old
self.model = models[np.argmax(scores)]
self.model.fit(X_train, y_train) #model has to be fitted again afterwards
There is probably a better way to do this, but it worked for me :) https://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html
@phibil Great! please submit a pull request?
Hi, I just noticed the same bug, any news about this? Moreover, If I disable the parallelization the method returns an array of None
I am still getting the same problem. Is there any fix to it?
Which OS are you working on? Plus, are you using default parallelize=True
within the call to GridSearch()
?
Parallelization issues on Windows systems still hold as far as I could understand some month ago. The actual solution might be using parallelize=False
.
I am using linux OS. I tried with both parallelize=True and parallelize=False .
I am using linux OS. I tried with both parallelize=True .
After doing parallelize false, I got an another error "zip argument #1 must support iteration" that is also one of the issue created in your github repository.
Same problem as Rajjat
Has this been resolved?
I tried to use your package with RandomForestRegressor, but I get an error
The output: