Enable random seed averaging

flamby commented 4 years ago

Hi @Y-oHr-N

Is it that you want to make your mllib.ensemble.RandomSeedAveragingRegressor and mllib.ensemble.RandomSeedAveragingClassifier compatible somehow w/ OptGBM?

I'd thought that since OptGBM follows the sklearn API, it would be compatible by default.

Or am I missing something?

Anyway, random seed averaging is something I'll need to test as stdev of differently seeded models is sometimes very high, and averaging could generate a more robust model.

Do you have datasets in mind on which it's improving?

Thanks

Y-oHr-N commented 4 years ago

Hi @flamby,

As you said, random seed averaging is possible by default.

The first way is to pass OGBMModel to RandomSeedAveragingModel. This takes a very long time because tuning is performed for the number of seeds.

The second way is to train OGBMModel and then pass LGBMModel with best hyperparameters to RandomSeedAveragingModel. This requires only one tuning, but does not guarantee that the best model can be trained for each seed.

This is a simple example using OptGBM 0.5.0.

import lightgbm as lgb

from mllib.ensemble import RandomSeedAveragingClassifier
from optgbm.sklearn import OGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 1. LightGBM
model = lgb.LGBMClassifier(random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 2. LightGBM + random seed averaging
model = lgb.LGBMClassifier()
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.960...

# 3. OptGBM (fold averaging)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.977...

# 4. OptGBM (single model)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)
model.refit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.980...

# 5. OptGBM (fold averaging) + random seed averaging (tune `n_estimators` times)
model = OGBMClassifier(n_trials=20)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.984...

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(**model.study_.best_params)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.968...

By the way, mllib is not currently being maintained and most of the code has been ported to pretools. I am trying to implement random seed averaging in pretools or OGBMModel.refit.

flamby commented 4 years ago

Hi @Y-oHr-N,

Thanks for the clarification. I ran all your examples with my dataset (OptGBM 0.5.0 and mllib of current git master branch) and had small improvements indeed. I must test it more extensively.

Except on example 5 for which I got the below error :

lib/python3.7/site-packages/lightgbm/sklearn.py in set_params(self, **params)
    366             setattr(self, key, value)
    367             if hasattr(self, '_' + key):
--> 368                 setattr(self, '_' + key, value)
    369             self._other_params[key] = value
    370         return self

AttributeError: can't set attribute

I'm used to rely a lot on decision threshold to improve my classification precision/recall, thanks to predict_proba method. If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging{Classifier,Regressor}?

Thanks and keep the good work!

Y-oHr-N commented 4 years ago

Except on example 5 for which I got the below error :

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

If it makes sense and is doable, do you plan to implement this method in RandomSeedAveraging {Classifier,Regressor}?

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

Thank you for your feedback.

flamby commented 4 years ago

I noticed the bug yesterday, immediately fixed it and released 0.5.0. If you are really using 0.5.0, please tell me your environment in detail. Example 5 works fine in my environment.

You're right. It appears it was a jupyter cache issue. Silly me. Restarting the kernel again fixed it.

I will consider implementation positively, but I cannot guarantee that it will be implemented soon. I am glad if you wait patiently or send a PR.

I've already monkey patch it, taking inspiration from the sklearn's VotingClassifier way to do it

Here it is. I hope I did not make mistakes.

model = lgb.LGBMClassifier()

def predict_proba(self, X):
    self._check_is_fitted()
    probas = np.asarray([e.predict_proba(X) for e in self.estimators_])
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', category=RuntimeWarning)
    avg = np.average(probas, axis=0)
    return avg

# monkey patching
RandomSeedAveragingClassifier.predict_proba = predict_proba

model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)

Y-oHr-N commented 4 years ago

Thank you for sharing your code. I implemented it in pretools and released the package to PyPI. Please try it.

flamby commented 4 years ago

Thank you very much @Y-oHr-N I'll test it in the coming days.

Y-oHr-N commented 4 years ago

Hi @flamby,

I noticed that example 6 had a mistake. The modified code is as follows.

# 6. OptGBM (fold averaging) + random seed averaging (tune only once)
model = OGBMClassifier(n_trials=20, random_state=0)

model.fit(X_train, y_train)

model = lgb.LGBMClassifier(n_estimators=model.best_iteration_, **model.best_params_)
model = RandomSeedAveragingClassifier(model, n_estimators=10, random_state=0)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)  # acc = 0.982...

flamby commented 4 years ago

Hi @Y-oHr-N

Thanks. I finally had time to test it, and it works like a charm.

Y-oHr-N / OptGBM

Enable random seed averaging #40