NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
257 stars 50 forks source link

[BUG] Getting error from XGB model when loading the model back and passing the booster arg to the constructor #651

Closed rnyak closed 2 years ago

rnyak commented 2 years ago

Bug description

I get the following error from when I load back the saved XGB model and then pass the booster argument to the constructor via XGBoost(schema, booster=reloaded_model)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [10], in <cell line: 7>()
      4 model.booster.save_model('mymodel.xgb')
      5 bst.load_model('mymodel.xgb')  # load model
----> 7 XGBoost(schema, booster=bst)

File /usr/local/lib/python3.8/dist-packages/merlin/models/xgb/__init__.py:66, in XGBoost.__init__(self, schema, target_columns, qid_column, objective, booster, **params)
     64 if isinstance(target_columns, str):
     65     target_columns = [target_columns]
---> 66 self.target_columns = target_columns or get_targets(schema, target_tag)
     67 self.feature_columns = get_features(schema, self.target_columns)
     69 if objective.startswith("rank") and qid_column is None:

File /usr/local/lib/python3.8/dist-packages/merlin/models/xgb/__init__.py:248, in get_targets(schema, target_tag)
    246 if len(targets) >= 1:
    247     return targets.column_names
--> 248 raise ValueError(
    249     f"No target columns in the dataset schema with tags TARGET and {target_tag.name}"
    250 )

ValueError: No target columns in the dataset schema with tags TARGET and REGRESSION

Steps/Code to reproduce bug

You can repro the error by running the code below:


from merlin.datasets.entertainment import get_movielens
import xgboost as xgb

train, valid = get_movielens(variant='ml-100k')
# remove cols from schema
schema = train.schema.without(['title', 'rating'])
xgb_booster_params = {
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
}

xgb_train_params = {
    'num_boost_round': 100,
    'verbose_eval': 20,
    'early_stopping_rounds': 10,
}

with Distributed():
    model = XGBoost(schema, **xgb_booster_params)
    model.fit(
        train,
        evals=[(valid, 'validation_set'),],
        **xgb_train_params
    )
    metrics = model.evaluate(valid)

import os
bst = xgb.Booster()  # init model
model.booster.save_model('mymodel.xgb')
bst.load_model('mymodel.xgb')  # load model

XGBoost(schema, booster=bst)

Expected behavior

I should be able to load back the saved XGB model and do offline inference with model.predict()

Environment details

Additional context

rnyak commented 2 years ago

@radekosmulski fyi.

radekosmulski commented 2 years ago

Would be great to have this in 22.08, but not sure that is still feasible 🙂

oliverholworthy commented 2 years ago

For this reloading to work this way the booster params need to be passed through as well on the last line.

XGBoost(schema, booster=bst, **xgb_booster_params )
oliverholworthy commented 2 years ago

Closing this as resolved by the above comment. Or using the new save/load methods in #656