dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

nfold_cv() omitted from current documentation #139

Closed bobaronoff closed 1 year ago

bobaronoff commented 1 year ago

I apologize if I am missing this but I do not see anything in the current documentation regarding the nfold_cv() function. (Can not seem to find it in the src files either ). Has this been removed from the package? Can't imagine building a gradient boost model without cross validation.

Thank you.

ExpandingMan commented 1 year ago

Yes, this was removed because there are several generic implementations of cross validation and it does not seem appropriate to maintain a separate implementation specifically for this wrapper. See, for example MLJ.jl or MLUtils.jl for a more minimal version.

bobaronoff commented 1 year ago

Thank you so much for your prompt reply. I will definitely look into the MLJ.jl and MLUtils.jl for their approach. Am not adverse to writing my own cv function. The issue that comes up is that invariably there is a feature column with only 2 or 3 missing values that get placed in the 'testing' fold with none in the 'training' fold. XGBoost can only score a missing feature if that feature had missing values in the training process. ( at least that is my understanding and experience in past ). This would cause XGBoost to throw exception; when there are many features it can be quite the task to sort this out. I've never been sure of the proper way to handle this as a general rule. Does not feel correct to remove test fold rows when a single feature has missing's not present in train folds. Have thought of 'peppering' the offending training fold columns with a few 'missing' values prior to training to avoid throwing an error when predicting the test fold. This would require a column-wise iteration for each fold run - doable but with increase of execution time. Have never heard of anyone doing this but again tree boosting is a bit unique in allowing missing values.

Thank you again.

ExpandingMan commented 1 year ago

It's a little surprising to me that it can only predict on missing if they were in training. I assume you are using missing as missing data? The wrapper should now support this. Admittedly I'm not completely sure what it does in those cases but I think it just filters them out (that is, just the feature, not the entire data point).

Could you show the error that occurs? It's possible this is a bug, though, again, my understanding of how libxgboost internals handle missing values is poor.

bobaronoff commented 1 year ago

I stand corrected. Wrote a test case and indeed you are right and I am not. Appears XGBoost creates a default path for all tree branches which is how it handles missing values. Wouldn't be the first time I over thought a problem - lol.

For what it's worth here is the test code. Scoring proceeds just fine with missing values eventhough no missing values in the training set (i.e. p2 in the code below).

using XGBoost

Xtrain = rand(200, 10)
Xtest = rand(20, 10)
Xtest2 = copy(Xtest)
Xtest2=convert(AbstractArray{Union{Missing, Float64}}, Xtest2)
for i in 1:10
    Xtest2[i,i]=missing
end
y = rand(200) 

b= xgboost((Xtrain, y), num_round=5 , max_depth=6, eta=0.1 , objective="reg:squarederror")

p1=predict(b,Xtest)

p2=predict(b,Xtest2)
ExpandingMan commented 1 year ago

Glad it worked out. I'm going to close this as there doesn't seem to be any action to take here.