dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

Specify list of iteration/tree indices to use in prediction #8699

Closed LudvicLaberge closed 1 year ago

LudvicLaberge commented 1 year ago

Looking at the predict documentation I saw the iteration_range functionnality. Was wondering if it is possible to pass a list of indices to use in prediction instead of a range. Say for example, I don’t want trees [3, 5, 15, …] to affect the prediction? But I want [1, 2, 4, … ] to be in it…

More details on my use case… There are features I want in training but not at inference. So I’ve put interaction constraints so that these features don’t interact with the other ones I want at inference. Using trees_to_dataframe I’m able to extract a list of indices that used the features I don’t want affecting the predictions at inference. Now I’d like a way to predict without these tree indices…

Tried editing the json of the booster to manually remove them but can’t seem to make it work. Was wondering if there is any other way?

https://discuss.xgboost.ai/t/specify-list-of-iteration-tree-indices-to-use-in-prediction/3033

trivialfis commented 1 year ago

Hi, if you are using the Python interface Booster object, you can use the slice operator: https://xgboost.readthedocs.io/en/stable/python/model.html along with base_margin in DMatrix to achieve what you want. I can work on an example later if necessary.

LudvicLaberge commented 1 year ago

Thanks for the quick reply! I'll give it a shot, thanks!

trivialfis commented 1 year ago

these tests might be helpful https://github.com/dmlc/xgboost/blob/e49e0998c0fd9f144f113d3eba2c43b7b951335a/tests/python/test_basic_models.py#L519 .

LudvicLaberge commented 1 year ago

So I gave it a shot @trivialfis

print(str(trees_to_remove)) [15, 21, 25, 27, 31, 34, 37, 39, 45, 48, 50, 54, 59, 61, 69, 73, 77, 83, 100, 109, 112, 114]

print(str(trees_to_keep)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 22, 23, 24, 26, 28, 29, 30, 32, 33, 35, 36, 38, 40, 41, 42, 43, 44, 46, 47, 49, 51, 52, 53, 55, 56, 57, 58, 60, 62, 63, 64, 65, 66, 67, 68, 70, 71, 72, 74, 75, 76, 78, 79, 80, 81, 82, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 102, 103, 104, 105, 106, 107, 108, 110, 111, 113, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198]

booster[trees_to_keep] TypeError: Expecting <class 'int'> or <class 'slice'>. Got <class 'list'>

Turns out I can't really single out trees with slicing as they are in no particular order...

So I read the doc provided and saw that we can do that: trees = [booster[e] for e in trees_to_keep] trees [ <xgboost.core.Booster at 0x7efd8ec349d0>, <xgboost.core.Booster at 0x7efd8ec34070>, <xgboost.core.Booster at 0x7efd8ec34640>, <xgboost.core.Booster at 0x7efd8ec34790>, ... ]

But was wondering if there was then a way to recombine these trees into a single booster object?

trivialfis commented 1 year ago

combining models is not supported. After slicing the booster, one has to sum the prediction himself with the help of base_margin. For regression objectives like squarederror where leaf value is the output prediction, tests referred in https://github.com/dmlc/xgboost/issues/8699#issuecomment-1397497819 should provide an example for how to merge the predictions.

LudvicLaberge commented 1 year ago

@trivialfis any way this could become a feature request? To also be able to index the booster? I am worried that predicting one tree at a time and passing them as base margin to the next one would slow down inference in production?

trivialfis commented 1 year ago

Feel free to open a new issue. We thought about having a concat method but dropped the idea as most of the use cases are users feeding predictions from other models (like a linear model or a NN) into XGBoost, instead of joining multiple boosters.

I'm not sure about the inference performance, you can try inplace_predict and see if it's efficient enough. Since you are deploying the model for inference, there's no need to cache the prediction.

LudvicLaberge commented 1 year ago

@trivialfis we can close this one as I've opened this one like you requested: https://github.com/dmlc/xgboost/issues/8709