dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.73k forks source link

process_type: update / updater: refresh deletes all trees from existing model #10390

Open lukezli opened 5 months ago

lukezli commented 5 months ago
image

I expect the above code to output a model with 100 trees in both cases. However, running this python script instead gives me:

model size before refresh 100 model size after refresh 10

Which is unexpected. This is on xgboost 2.0.3.

Am I doing something wrong / misunderstanding how the refresh parameter should work?

trivialfis commented 5 months ago

Hi, the number of boosting rounds needs to be set for refresh to match the number of boosted rounds in the existing model.

lukezli commented 5 months ago

Thanks! Can you explain the effect of eta in the context of refresh?

My base model is trained on 1 year of data. I want to add 1 additional week of data (for so-called incremental learning), but I noticed that, for my regression task:

using updater:refresh, and num_boost_rounds matching makes my predictions go very close to zero as eta -> 0, whereas I was under the impression that as eta -> 0 my refreshed model should more closely match the un-refreshed model (instead it seems to behave as if the only data that matters is the new set of data). is there any way to perform incremental learning such that the bulk of my original data is still kept in the model and I only make small updates to the leaf weights based on the new data?