dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.06k stars 8.7k forks source link

XGBoostClassifier vectorized output #2038

Closed Meemaw closed 6 years ago

Meemaw commented 7 years ago

I was wondering if it is possible to get vectors from XGBoostClassifier estimators. For example if we have trained Classifier with 3 estimators and each has 4 leaf nodes, then each would produce vector of size 4 on a given recod:

[0 0 1 0] + [0 1 0 0] + [1 0 0 1]

Where 1 is the leaf node that the train record falls in.

Gunthard commented 7 years ago

Hi, you can get the index of each leaf node of all trees used for a prediction. Just use the predict method of the booster object (http://xgboost.readthedocs.io/en/latest/python/python_api.html, class xgboost.Booster) with pred_leaf=True.

from sklearn.datasets import load_breast_cancer
from xgboost import XGBClassifier, DMatrix

X = load_breast_cancer().data
y = load_breast_cancer().target

clf = XGBClassifier(n_estimators=3, max_depth=2)
clf.fit(X, y)

booster = clf.booster()
print booster.predict(DMatrix(X[:5]), pred_leaf=True)
>> array([[5, 6, 6],
>>           [6, 6, 6],
>>           [6, 6, 6],
>>           [4, 4, 6],
>>           [5, 6, 6]], dtype=int32)

If you are interested in the leaf node value as well you can use get_dump() and parse the output:

import re
reg_ex = r"(\d+):leaf=([-+]?\d*\.\d+|\d+)"
print ['Tree {}: {}'.format(i, re.findall(reg_ex, tree_str)) for i, tree_str in enumerate(booster.get_dump())]
>> ["Tree 0: [('3', '0.191691'), ('4', '-0.04'), ('5', '0.00952381'), ('6', '-0.19096')]",
>> "Tree 1: [('3', '0.165413'), ('4', '-0.130044'), ('5', '-0.00545294'), ('6', '-0.176956')]",
>> "Tree 2: [('3', '0.154973'), ('4', '-0.106707'), ('5', '0.0589427'), ('6', '-0.158889')]"]