Open Y-oHr-N opened 4 years ago
Hi @flamby,
As you said, random seed averaging is possible by default.
The first way is to pass OGBMModel
to RandomSeedAveragingModel
. This takes a very long time because tuning is performed for the number of seeds.
The second way is to train OGBMModel
and then pass LGBMModel
with best hyperparameters to RandomSeedAveragingModel
. This requires only one tuning, but does not guarantee that the best model can be trained for each seed.
This is a simple example using OptGBM 0.5.0.
import lightgbm as lgb
from mllib.ensemble import RandomSeedAveragingClassifier
from optgbm.sklearn import OGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# 1. LightGBM
model = lgb.LGBMClassifier(random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.960...
# 2. LightGBM + random seed averaging
model = lgb.LGBMClassifier()
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.960...
# 3. OptGBM (fold averaging)
model = OGBMClassifier(n_trials=20, random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.977...
# 4. OptGBM (single model)
model = OGBMClassifier(n_trials=20, random_state=0)
model.fit(X_train, y_train)
model.refit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.980...
# 5. OptGBM (fold averaging) + random seed averaging (tune `n_estimators` times)
model = OGBMClassifier(n_trials=20)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.984...
# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)
model.fit(X_train, y_train)
model = lgb.LGBMClassifier(**model.study_.best_params)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.968...
By the way, mllib is not currently being maintained and most of the code has been ported to pretools. I am trying to implement random seed averaging in pretools or OGBMModel.refit
.
Hi @Y-oHr-N,
Thanks for the clarification. I ran all your examples with my dataset (OptGBM 0.5.0 and mllib of current git master branch) and had small improvements indeed. I must test it more extensively.
Except on example 5 for which I got the below error :
lib/python3.7/site-packages/lightgbm/sklearn.py in set_params(self, **params)
366 setattr(self, key, value)
367 if hasattr(self, '_' + key):
--> 368 setattr(self, '_' + key, value)
369 self._other_params[key] = value
370 return self
AttributeError: can't set attribute
I'm used to rely a lot on decision threshold to improve my classification precision/recall, thanks to predict_proba
method. If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging{Classifier,Regressor}?
Thanks and keep the good work!
Except on example 5 for which I got the below error :
I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.
If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging {Classifier,Regressor}?
I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.
Thank you for your feedback.
I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.
You're right. It appears it was a jupyter cache issue. Silly me. Restarting the kernel again fixed it.
I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.
I've already monkey patch it, taking inspiration from the sklearn's VotingClassifier way to do it
Here it is. I hope I did not make mistakes.
model = lgb.LGBMClassifier()
def predict_proba(self, X):
self._check_is_fitted()
probas = np.asarray([e.predict_proba(X) for e in self.estimators_])
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=RuntimeWarning)
avg = np.average(probas, axis=0)
return avg
# monkey patching
RandomSeedAveragingClassifier.predict_proba = predict_proba
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)
Thank you for sharing your code. I implemented it in pretools and released the package to PyPI. Please try it.
Thank you very much @Y-oHr-N I'll test it in the coming days.
Hi @flamby,
I noticed that example 6 had a mistake. The modified code is as follows.
# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)
model.fit(X_train, y_train)
model = lgb.LGBMClassifier(n_estimators=model.best_iteration_, **model.best_params_)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # acc = 0.982...
Hi @Y-oHr-N
Thanks. I finally had time to test it, and it works like a charm.
Hi @Y-oHr-N
Is it that you want to make your
mllib.ensemble.RandomSeedAveragingRegressor
andmllib.ensemble.RandomSeedAveragingClassifier
compatible somehow w/ OptGBM?I'd thought that since OptGBM follows the sklearn API, it would be compatible by default.
Or am I missing something?
Anyway, random seed averaging is something I'll need to test as stdev of differently seeded models is sometimes very high, and averaging could generate a more robust model.
Do you have datasets in mind on which it's improving?
Thanks