Closed FlorisHoogenboom closed 6 years ago
you are right. It doesn't support cross validation in sklearn. In fact, making it work is not easy. Please have a try!
In addition,
The XGBRanker takes all x as single group when group=None.
def fit(self, X, y, group=None, eval_metric=None, sample_weight=None, early_stopping_rounds=None, verbose=True):
There are two possible solution paths I guess:
In addition (but that's maybe something of later concern) I think it would be nice to have a wrapper for other sklearn estimators so they can be used following the same paradigm. This way it also becomes possible to compare the XGBoostRanker with estimators that are not group-aware.
I've got a proposal for the second path almost ready. I'll submit a PR when I've finished that and then we can have a look whether we want to go with that solution, or take the other path. The changes between both paths are small I believe.
Wonderful idea! Your ideas are very promising!
But I'm still worried about whether it can work correctly. If we ask users to add group labels to X or Y: [1]. Users have to reshape data with original data and group labels before fit/predict is called. [2]. In XGBRanker, data should be splitted to X, Y, group labels before real running. [3]. Group labels refers to element numbers of these groups. It can only handle one by one and doesn't support out-of-order. So there are some problems:
In fact, I think this wrapper is very useful too. So when I started this project, I planned to create a wrapper of XGBRanker to support cross validation. At last, I cancelled to realize it as a result of above reasons.
I've opend a PR with some ideas. Basically what happens is that group labels are passed as a first column in the feature matrix X. Next the function util._preprare_data_in_groups
sorts the labels such that samples in the same group are positioned together and calculates the group sizes. Of course, this sorting may be a bit expensive. However, in the original set-up the user had to do sorting the samples per group himself before feeding it into the algorithm.
I think this solution fixxes your points [2] and [3]. Let me know what you think, I would understand if this is not a solution you like! Always happy to incorporate any changes.
Currently the API of the XGBRanker is not in line with the scikit-learn api. The predict method needs to have the groups keyword as an input, something with does not adhere to the sklearn standards.
I would like to propose to make the groups indiciator part of the dataset being fed into the fit and predict methods, e.g. as the first column. As a result of this we would be able to use sklearn to do cross validation for example (something which is currently not supported by the XGBoost module itself: https://github.com/dmlc/xgboost/issues/270). If you like this idea as well, I would be happy to open a PR for this.