dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Prevent xgboost from splitting only on missing values #6878

Open gravesee opened 3 years ago

gravesee commented 3 years ago

The xgboost documentation and papers that discuss how it works hints strongly that missing values are sent down the left or right branch along with non-missing records. However, it is clear from the histogram splits that there are many cases where xgboost will send all of the missing data down one branch and the non-missing data down the other.

In our testing this has caused problems for monotonicity from an approximate feature contribution perspective. It is possible that a feature value with the maximum possible value is violating monotinicity because it is being grouped with all of the non-missing data.

Is there anyway to ensure that xgboost never sends only missing data down a branch?

trivialfis commented 3 years ago

Right now I don't think it's possible. Could you please share the script you are running and some pseudo data that can demonstrate the issue?

gravesee commented 3 years ago

In my small example below, all of the 1s in the y array are missing in the x array. This seems to have the effect of splitting on missing versus not missing. I think the desired output would be to use the non-missing values only for determining the split (subject to other constraints) and then sending the missing data down the left or right branch based on gain.

import xgboost
import numpy as np

x = np.array([np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1, 2, 3]).reshape(-1, 1)
y = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0])

d = xgboost.DMatrix(x, label=y)

params = {
    "tree_method": "exact",
    "random_state": 42
}

mod = xgboost.train({}, dtrain=d)

mod.get_split_value_histogram('f0')
#>     SplitValue  Count
#> 0    6.500001   10.0
trivialfis commented 3 years ago

Note to myself:

rma47 commented 3 years ago

Hi, I've just come across the same problem but it appears that the split on missing is valid in my data (example below). The missing data split was unexpected but reasonable as the missing data is unlike the other data that is present. I would like to draw this to your attention as preventing xgboost splitting on missing values would have meant that I could not find this important feature of my data. Maybe its possible to have missing values as a warning message + an option ?

value odds ratio 0 1.02 1 1.04 2 1.01 NA 0.45

Regards, Rachel