Closed tqchen closed 8 years ago
Is the rank algorithm lambda mart? In the source code it is written lambda rank. I would be interested in reading papers about the one implemented.
yes, it is lambdaMART
But this issue have nothing to do with the actual algorithm. It is about data partitioning for rank tasks.
@tqchen Why is it called LambdaRank and not LambdaMART ? I was a bit confused too. (I know this is not related to issue #270 :) )
In machine learning problems, there are two things, model and objective function. Model means how you make predictions, while objective means how you measure the loss function and optimize it, See also my slides on gbtree.
LambdaMART = LambdaRank(objective) applied to MART(model). MART is also name for tree ensemble, regression tree ensemble.
@tqchen yes I have read the LambdaMART paper. Thanks for the explanation though :) My confusion was from the implementation in R/gbm package. They just call it pairwise objective in it. Your slides are pretty good btw!
pairwise ranking objective usually refers to the case where all postive negative pairs have equal weights. While lambda rank means the weight of pairs depends on the abs(delta metric change). If you read the paper about LambdaRank you will find the difference.
@tqchen I've got the problem of performing Cross Validation in the 'rank:pairwise' setting.
After setting up the DMatrix and using the set_group() method (I've passed to this method a numpy.array data structure), I've encountered a problem while CrossValidation.
Here is my Python source code:
xgdmat = xgb.DMatrix(X_training, y_training) # Create our DMatrix to make XGBoost more efficient xgdmat.set_group(group=groups_query_id) # Set the query_id values to DMatrix data structure
model_parameters = {'objective': 'rank:pairwise', 'seed': 0, 'booster' : ['gbtree', 'gblinear, dart'], 'eta': [0.1, 0.2, 0.3, 0.4, 0.5], 'gamma' : [0, 1], 'subsample': [0.5, 0.75, 0.9], 'max_depth': [3, 5], 'min_child_weight': 1, 'max_delta_step' : 0, 'colsample_bytree': [0.5, 0.75, 0.9], 'colsample_bylevel' : [0.5, 0.75, 0.9], 'lambda' : 1, 'alpha' : 0, 'tree_method' : ['auto', 'exact', 'approx', 'hist']}
cv_xgb = xgb.cv(params=model_parameters, dtrain=xgdmat, num_boost_round=1000, nfold=10, metrics=['auc', 'ndcg', 'map'], early_stopping_rounds=100)
print cv_xgb.tail(5)
final_gb = xgb.train(model_parameters, xgdmat, num_boost_round=500) When I launch this program, I find this kind of problem:
[15:43:58] dmlc-core/include/dmlc/logging.h:235: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure Traceback (most recent call last): File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 225, in metodo3() File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 164, in metodo3 metrics=['auc, ''ndcg', 'map'], early_stopping_rounds=100) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 371, in cv cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc, stratified, folds) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 248, in mknfold dtrain = dall.slice(np.concatenate([idset[i] for i in range(nfold) if k != i])) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 531, in slice ctypes.byref(res.handle))) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 127, in _check_call raise XGBoostError(_LIB.XGBGetLastError()) xgboost.core.XGBoostError: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure
How can I solve this problem?
Thanks for the attention
The same thing happened to me. According to my error message, maybe it has something to do with xgb.cv'nfold fun.
Basically with group information,a stratified nfold should take place, but how to do a stratified nfold? with labels or group_info? which one make's more sence? Maybe it's not clear.
Try to directly use sklearn's Stratified K-Folds instead. Or just use different groups. Some group for train, Some group for test.
This was important for rank typed tasks. For rank type tasks, there is an group information to group a instances into a List. In cross validation, folds should be created such that each group was splitted either into train or test.
Currently This was not handled correctly, which means cv do not work properly for rank typed tasks. Here are something that needs to do
Since less user use cv for rank task so far. This is not an urgent issue, but I am opening this for now. To see if anyone want to take it over. This is likely need to have things to do with the C wrapper part as well as the language bindings.