Cannot reproduce optimization results for XGBClassifier

RonaldGalea commented 3 years ago

Hello, I'm having an issue getting reproducible results when optimizing an XGBClassifier. I'm using BayesSearchCV to optimize some hyperparameters, defined in a search grid. When running locally, everything works as expected and I get the same results each time I run. However, running on a (local) dask cluster gives different results each time, please see the code snippet below.

Code to reproduce issue:

from copy import deepcopy

import numpy as np
import pandas as pd
import skopt.space as skspace
import xgboost
from dask.distributed import Client
from distributed import LocalCluster
from IPython.display import display
from joblib import parallel_backend
from skopt import BayesSearchCV

# generate some data
np.random.seed(0)
train_data = pd.DataFrame(np.random.rand(1000, 10))
labels = pd.Series(np.random.randint(2, size=1000))

# define xgb model, bayesian search grid and optimizer
model = xgboost.XGBClassifier(random_state=0)

xgboost_search_grid = {
    'n_estimators': skspace.Integer(45, 100),
    'max_depth': skspace.Integer(5, 15),
    'colsample_bytree': skspace.Real(0.08, 0.3),
    'subsample': skspace.Real(0.2, 0.8),
    'learning_rate': skspace.Real(0.01, 0.15)
}

opt = BayesSearchCV(model, xgboost_search_grid, random_state=0, n_jobs=-1, n_iter=4, n_points=2, cv=10, refit=False)

# run optimization locally twice, we can see results are the same
local_res1 = deepcopy(opt.fit(train_data, labels))
print("Local run 1:", local_res1.cv_results_['mean_test_score'], "\n")

local_res2 = deepcopy(opt.fit(train_data, labels))
print("Local run 2:", local_res2.cv_results_['mean_test_score'], "\n")

# set up local dask cluster
cluster = LocalCluster()
client = Client(cluster)

# run optimization on a local dask cluster -> The results are different each time
with parallel_backend('dask'):
    dask_res1 = deepcopy(opt.fit(train_data, labels))
print("Dask run 1:", dask_res1.cv_results_['mean_test_score'], "\n")

with parallel_backend('dask'):
    dask_res2 = deepcopy(opt.fit(train_data, labels))
print("Dask run 2:", dask_res2.cv_results_['mean_test_score'], "\n")

# inspecting the full results, we can see that the sets of hyperparameters that are evaluated are identical for both runs
# this should mean that the difference comes from training the xgboost models
display("Evaluated hyperparameters dask run 1:", dask_res1.cv_results_["params"])
print("\n\n----------------------\n\n")
display("Evaluated hyperparameters dask run 2:", dask_res2.cv_results_["params"])

# just to confirm
assert dask_res1.cv_results_["params"] == dask_res2.cv_results_["params"]
assert np.array_equal(dask_res1.cv_results_["mean_test_score"], dask_res2.cv_results_["mean_test_score"]) is False

Output:

Local run 1: [0.529 0.491 0.511 0.512] 

Local run 2: [0.529 0.491 0.511 0.512] 

Dask run 1: [0.506 0.522 0.516 0.516] 

Dask run 2: [0.527 0.518 0.518 0.514] 

'Evaluated hyperparameters dask run 1:'
[OrderedDict([('colsample_bytree', 0.19681211628947243),
              ('learning_rate', 0.10465113124276788),
              ('max_depth', 11),
              ('n_estimators', 81),
              ('subsample', 0.7152462989930835)]),
 OrderedDict([('colsample_bytree', 0.29572237778914645),
              ('learning_rate', 0.02792961188398764),
              ('max_depth', 5),
              ('n_estimators', 77),
              ('subsample', 0.3453743955677361)]),
 OrderedDict([('colsample_bytree', 0.13684176563400458),
              ('learning_rate', 0.1470465121796906),
              ('max_depth', 14),
              ('n_estimators', 78),
              ('subsample', 0.35751455362325446)]),
 OrderedDict([('colsample_bytree', 0.11581214899157613),
              ('learning_rate', 0.11818146284860102),
              ('max_depth', 11),
              ('n_estimators', 51),
              ('subsample', 0.31304057362794935)])]

----------------------

'Evaluated hyperparameters dask run 2:'
[OrderedDict([('colsample_bytree', 0.19681211628947243),
              ('learning_rate', 0.10465113124276788),
              ('max_depth', 11),
              ('n_estimators', 81),
              ('subsample', 0.7152462989930835)]),
 OrderedDict([('colsample_bytree', 0.29572237778914645),
              ('learning_rate', 0.02792961188398764),
              ('max_depth', 5),
              ('n_estimators', 77),
              ('subsample', 0.3453743955677361)]),
 OrderedDict([('colsample_bytree', 0.13684176563400458),
              ('learning_rate', 0.1470465121796906),
              ('max_depth', 14),
              ('n_estimators', 78),
              ('subsample', 0.35751455362325446)]),
 OrderedDict([('colsample_bytree', 0.11581214899157613),
              ('learning_rate', 0.11818146284860102),
              ('max_depth', 11),
              ('n_estimators', 51),
              ('subsample', 0.31304057362794935)])]

It appears the differences come from the training of the XGBClassifiers, because the sets of evaluated hyperparameters are the same across runs. I also noticed there might be something amiss with this particular search grid, because if any of the entries are removed, the results are reproducible again.

Library versions:

XGBoost version: 1.3.3 Scikit-optimize version: 0.9.dev0 Joblib version: 1.0.1 Dask version: 2021.04.0

hcho3 commented 3 years ago

Have you tried removing all sampling hyperparameters from the search grid?

trivialfis commented 3 years ago

This is likely the same issue with global rng. I have a WIP branch removing it, will prioritize it.

RonaldGalea commented 3 years ago

@hcho3 Removing any of the 5 hyperparameters from the grid makes it work correctly. It is somehow these exact 5 that cause the issue.

apatange-source commented 2 years ago

Is this issue closed or is it still WIP?

dmlc / xgboost

Cannot reproduce optimization results for XGBClassifier #7057