dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.37k stars 8.74k forks source link

Cannot use random state passed by IterativeImputer #5030

Closed david-cortes closed 5 years ago

david-cortes commented 5 years ago

If I try to use XGBRegressor in SciKit Learn's Iterative Imputer, it will fail due to the random state that is passed from the imputer - example:

estimator = make_pipeline(
    IterativeImputer(XGBRegressor()),
    Ridge()
)
XGBoostError: Invalid Parameter format for seed expect int but value='<mtrand.RandomState object at 0x7f865f1d30f0>'

Would be nice if xgboost's scikit-learn API classes could accept this type of random state like sklearn's own classifiers/regressors.

Full example:

import numpy as np
from xgboost import XGBRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(1)
X = np.random.normal(size = (100, 5))
X[np.random.randint(100, size = 10), np.random.randint(5, size = 10)] = np.nan
y = np.random.normal(size = 100)
estimator = make_pipeline(
    IterativeImputer(XGBRegressor(n_jobs=16)),
    XGBRegressor()
)
cross_val_score(
    estimator, X, y, scoring='neg_mean_squared_error',
    cv=5
)
trivialfis commented 5 years ago

@david-cortes Would you like to open a PR as you already familiar with it?

david-cortes commented 5 years ago

I’m not really familiar with how it works, but have been checking it in a bit more detail.

I’ve now realized that the class RandomState is from numpy (link), and as of v1.17, it is a container for a random state in a Mersenne-Twister pseudo-random number generator, which is supposed to be used by the classifier/regressor for random number generation, with its state becoming changed as it generates random numbers, and this new state then passed to the next classifier/regressor.

Since xgboost doesn’t use this python class for random number generation, and it is able to use something other than the default C++’s MT19937, I think it might not be feasible to simply convert python-C++ states back and forth. Even then, NumPy might change the default RNG from MT19937 to something else in the future too.

I guess a potential solution would be to use the python RandomState object to generate a random integer, which would then be set as seed for xgboost’s C++ RNG – that way, it achieves the purpose of setting a reproducible seed, and its random state is modified, even though the quality of these short seedings might not be very good.

Would that be an acceptable solution?

trivialfis commented 5 years ago

I’m not really familiar with how it works, but have been checking it in a bit more detail.

@david-cortes It seems you have a lot more insight than me. ;-)

guess a potential solution would be to use the python RandomState object to generate a random integer, which would then be set as seed for xgboost’s C++ RNG – that way, it achieves the purpose of setting a reproducible seed, and its random state is modified, even though the quality of these short seedings might not be very good.

Glancing through sklearn's document, is it a fair guess that the random state is changed per iteration? We can do that for XGBoost too, this way I believe the effect is the same. Haven't look into their code yet but sklearn has to do similar thing to actually use RandomState right?

david-cortes commented 5 years ago

Took a look at SciKit-Learn's IterativeImputer code and it seems the random state is modified at each iteration if some random number is generated inside the regressor/classifier, as it simply sets the attribute random_state of the object to its own (which does not perform a deep copy AFAIK and from some quick testing), and once this object generates a random number, the state will change.

Made a quick PR with the approach of using the RandomState object to draw an integer seed, but I guess there's other possible approaches too.