Closed driftwoods closed 6 years ago
+1, I have the same concerns. Rules with 1.00136e-05 seem highly suspicious for me.
[UPDATE] check http://stats.stackexchange.com/questions/193617/machine-learning-on-dummy-variables
Here is the cause I found out. By default xgboost use base_score=0.5
, therefore output from predict
function call needs to subtract base_score
to get the plain prediction from xgboost.
In the example I have preds[0]-base_score=0.07206208-0.5=-0.4279379
which is exactly the 4th node value of booster[0]
.
I tried to set base_score=0
in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.
Oh, great thanks for this explanation! You've just saved my night) Probably this unobvious effect deserves a separate issue...
But still, how do you interpret the comparisons f29<-1.00136e-05, f109<-1.00136e-05 for a binary feature?
[UPDATED]
You may wonder how to interpret the < 1.00001 on the first line. Basically, in a sparse Matrix, there is no 0, therefore, looking for one hot-encoded categorical observations validating the rule < 1.00001 is like just looking for 1 for this feature.
xgboost treats sparse values as "missing", so they go into the missing branch independent of the split value.
Perhaps that should be in FAQ...
@driftwoods :
I tried to set base_score=0 in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.
Could you please provide an example of what you mean here? I've just tried 'base_score':0 , and the predictions from the 1st tree were exactly the leaf values, as it was supposed to be.
@khotilov
I was using a old version of xgboost (the one installed from pip which is compiled in Dec 2015). Setting 'base_score':0
has the same effect as the default 'base_score':0.5
. That is you have to subtract 0.5 from predict to get the same result from leaf value. This is a bug that has been fixed in the latest version, as your result shows.
Another bug I found with the pip version of xgboost is that it treats 0 as missing in non sparse matrix. This bug is also fixed in latest version. I strongly recommend the maintainer of xgboost in pip to update to the latest version.
Use the demo example of
xgboost/demo/guide-python/basic_walkthrough.py
to show the issue.First, let's train a simple regrsssion model.
Then we dump the model to a text file.
This is the content of model dump text file.
Let's exmaine the first ten predictions.
Below is the output.
preds
give the plain regression values andleafs
give the leaf node number.Here is the issue: for
preds[0]
the predicted value is0.07206208
, whileleafs[0]
gives-0.427938
which is the 4th node value ofbooster[0]
. I would expect to get the same value from straight prediction or by leaf node number. Settingntree_limit = 2
orntree_limit = 0
to use all trees still gives insistent predictions.I am a first time user of xgboost, something could be wrong with my understanding. But is it possible something wrong with
dump_model
member function?