dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.24k stars 8.73k forks source link

Inconsistent predict for reg:linear objective? #1274

Closed driftwoods closed 6 years ago

driftwoods commented 8 years ago

Use the demo example of xgboost/demo/guide-python/basic_walkthrough.py to show the issue.

First, let's train a simple regrsssion model.

import numpy as np
import xgboost as xgb

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'reg:linear' }

watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

Then we dump the model to a text file.

bst.dump_model('/tmp/dump.raw.txt')

This is the content of model dump text file.

booster[0]:
0:[f29<-1.00136e-05] yes=1,no=2,missing=1
    1:[f56<-1.00136e-05] yes=3,no=4,missing=3
        3:leaf=0.42844
        4:leaf=-0.427938
    2:[f109<-1.00136e-05] yes=5,no=6,missing=5
        5:leaf=-0.485704
        6:leaf=0.490741
booster[1]:
0:[f60<-1.00136e-05] yes=1,no=2,missing=1
    1:[f67<-1.00136e-05] yes=3,no=4,missing=3
        3:leaf=0.0137908
        4:leaf=0.790517
    2:leaf=-0.9226

Let's exmaine the first ten predictions.

preds = bst.predict(dtest, ntree_limit=1)
leafs = bst.predict(dtest, ntree_limit=1, pred_leaf=True)
print preds[0:10]
print leafs[0:10]

Below is the output. preds give the plain regression values and leafs give the leaf node number.

[ 0.07206208  0.9284395   0.07206208  0.07206208  0.01429605  0.9284395
  0.9284395   0.07206208  0.9284395   0.9284395 ]
[4 3 4 4 5 3 3 4 3 3]

Here is the issue: for preds[0] the predicted value is 0.07206208, while leafs[0] gives -0.427938 which is the 4th node value of booster[0]. I would expect to get the same value from straight prediction or by leaf node number. Setting ntree_limit = 2 or ntree_limit = 0 to use all trees still gives insistent predictions.

I am a first time user of xgboost, something could be wrong with my understanding. But is it possible something wrong with dump_model member function?

evgenity commented 8 years ago

+1, I have the same concerns. Rules with 1.00136e-05 seem highly suspicious for me.

[UPDATE] check http://stats.stackexchange.com/questions/193617/machine-learning-on-dummy-variables

driftwoods commented 8 years ago

Here is the cause I found out. By default xgboost use base_score=0.5, therefore output from predict function call needs to subtract base_score to get the plain prediction from xgboost.

In the example I have preds[0]-base_score=0.07206208-0.5=-0.4279379 which is exactly the 4th node value of booster[0].

I tried to set base_score=0 in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.

evgenity commented 8 years ago

Oh, great thanks for this explanation! You've just saved my night) Probably this unobvious effect deserves a separate issue...

evgenity commented 8 years ago

But still, how do you interpret the comparisons f29<-1.00136e-05, f109<-1.00136e-05 for a binary feature?

[UPDATED]

You may wonder how to interpret the < 1.00001 on the first line. Basically, in a sparse Matrix, there is no 0, therefore, looking for one hot-encoded categorical observations validating the rule < 1.00001 is like just looking for 1 for this feature.

khotilov commented 8 years ago

xgboost treats sparse values as "missing", so they go into the missing branch independent of the split value.

Perhaps that should be in FAQ...

khotilov commented 8 years ago

@driftwoods :

I tried to set base_score=0 in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.

Could you please provide an example of what you mean here? I've just tried 'base_score':0 , and the predictions from the 1st tree were exactly the leaf values, as it was supposed to be.

driftwoods commented 8 years ago

@khotilov I was using a old version of xgboost (the one installed from pip which is compiled in Dec 2015). Setting 'base_score':0 has the same effect as the default 'base_score':0.5. That is you have to subtract 0.5 from predict to get the same result from leaf value. This is a bug that has been fixed in the latest version, as your result shows.

Another bug I found with the pip version of xgboost is that it treats 0 as missing in non sparse matrix. This bug is also fixed in latest version. I strongly recommend the maintainer of xgboost in pip to update to the latest version.