Open gravesee opened 3 years ago
Right now I don't think it's possible. Could you please share the script you are running and some pseudo data that can demonstrate the issue?
In my small example below, all of the 1s in the y array are missing in the x array. This seems to have the effect of splitting on missing versus not missing. I think the desired output would be to use the non-missing values only for determining the split (subject to other constraints) and then sending the missing data down the left or right branch based on gain.
import xgboost
import numpy as np
x = np.array([np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1, 2, 3]).reshape(-1, 1)
y = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0])
d = xgboost.DMatrix(x, label=y)
params = {
"tree_method": "exact",
"random_state": 42
}
mod = xgboost.train({}, dtrain=d)
mod.get_split_value_histogram('f0')
#> SplitValue Count
#> 0 6.500001 10.0
Hi, I've just come across the same problem but it appears that the split on missing is valid in my data (example below). The missing data split was unexpected but reasonable as the missing data is unlike the other data that is present. I would like to draw this to your attention as preventing xgboost splitting on missing values would have meant that I could not find this important feature of my data. Maybe its possible to have missing values as a warning message + an option ?
value odds ratio 0 1.02 1 1.04 2 1.01 NA 0.45
Regards, Rachel
The xgboost documentation and papers that discuss how it works hints strongly that missing values are sent down the left or right branch along with non-missing records. However, it is clear from the histogram splits that there are many cases where xgboost will send all of the missing data down one branch and the non-missing data down the other.
In our testing this has caused problems for monotonicity from an approximate feature contribution perspective. It is possible that a feature value with the maximum possible value is violating monotinicity because it is being grouped with all of the non-missing data.
Is there anyway to ensure that xgboost never sends only missing data down a branch?