dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

Group-aware CV Fold Creation for Rank Task #270

Closed tqchen closed 8 years ago

tqchen commented 9 years ago

This was important for rank typed tasks. For rank type tasks, there is an group information to group a instances into a List. In cross validation, folds should be created such that each group was splitted either into train or test.

Currently This was not handled correctly, which means cv do not work properly for rank typed tasks. Here are something that needs to do

Since less user use cv for rank task so far. This is not an urgent issue, but I am opening this for now. To see if anyone want to take it over. This is likely need to have things to do with the C wrapper part as well as the language bindings.

pommedeterresautee commented 9 years ago

Is the rank algorithm lambda mart? In the source code it is written lambda rank. I would be interested in reading papers about the one implemented.

tqchen commented 9 years ago

yes, it is lambdaMART

tqchen commented 9 years ago

But this issue have nothing to do with the actual algorithm. It is about data partitioning for rank tasks.

ajkl commented 9 years ago

@tqchen Why is it called LambdaRank and not LambdaMART ? I was a bit confused too. (I know this is not related to issue #270 :) )

tqchen commented 9 years ago

In machine learning problems, there are two things, model and objective function. Model means how you make predictions, while objective means how you measure the loss function and optimize it, See also my slides on gbtree.

LambdaMART = LambdaRank(objective) applied to MART(model). MART is also name for tree ensemble, regression tree ensemble.

ajkl commented 9 years ago

@tqchen yes I have read the LambdaMART paper. Thanks for the explanation though :) My confusion was from the implementation in R/gbm package. They just call it pairwise objective in it. Your slides are pretty good btw!

tqchen commented 9 years ago

pairwise ranking objective usually refers to the case where all postive negative pairs have equal weights. While lambda rank means the weight of pairs depends on the abs(delta metric change). If you read the paper about LambdaRank you will find the difference.

TheEdoardo93 commented 6 years ago

@tqchen I've got the problem of performing Cross Validation in the 'rank:pairwise' setting.

After setting up the DMatrix and using the set_group() method (I've passed to this method a numpy.array data structure), I've encountered a problem while CrossValidation.

Here is my Python source code:

xgdmat = xgb.DMatrix(X_training, y_training) # Create our DMatrix to make XGBoost more efficient xgdmat.set_group(group=groups_query_id) # Set the query_id values to DMatrix data structure

model_parameters = {'objective': 'rank:pairwise', 'seed': 0, 'booster' : ['gbtree', 'gblinear, dart'], 'eta': [0.1, 0.2, 0.3, 0.4, 0.5], 'gamma' : [0, 1], 'subsample': [0.5, 0.75, 0.9], 'max_depth': [3, 5], 'min_child_weight': 1, 'max_delta_step' : 0, 'colsample_bytree': [0.5, 0.75, 0.9], 'colsample_bylevel' : [0.5, 0.75, 0.9], 'lambda' : 1, 'alpha' : 0, 'tree_method' : ['auto', 'exact', 'approx', 'hist']}

The problem occurs in the following line

cv_xgb = xgb.cv(params=model_parameters, dtrain=xgdmat, num_boost_round=1000, nfold=10, metrics=['auc', 'ndcg', 'map'], early_stopping_rounds=100)

print cv_xgb.tail(5)

final_gb = xgb.train(model_parameters, xgdmat, num_boost_round=500) When I launch this program, I find this kind of problem:

[15:43:58] dmlc-core/include/dmlc/logging.h:235: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure Traceback (most recent call last): File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 225, in metodo3() File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 164, in metodo3 metrics=['auc, ''ndcg', 'map'], early_stopping_rounds=100) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 371, in cv cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc, stratified, folds) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 248, in mknfold dtrain = dall.slice(np.concatenate([idset[i] for i in range(nfold) if k != i])) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 531, in slice ctypes.byref(res.handle))) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 127, in _check_call raise XGBoostError(_LIB.XGBGetLastError()) xgboost.core.XGBoostError: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure

How can I solve this problem?

Thanks for the attention

solenbanson commented 6 years ago

The same thing happened to me. According to my error message, maybe it has something to do with xgb.cv'nfold fun.

Basically with group information,a stratified nfold should take place, but how to do a stratified nfold? with labels or group_info? which one make's more sence? Maybe it's not clear.

Try to directly use sklearn's Stratified K-Folds instead. Or just use different groups. Some group for train, Some group for test.