sklearn API compatibility

FlorisHoogenboom commented 6 years ago

Currently the API of the XGBRanker is not in line with the scikit-learn api. The predict method needs to have the groups keyword as an input, something with does not adhere to the sklearn standards.

I would like to propose to make the groups indiciator part of the dataset being fed into the fit and predict methods, e.g. as the first column. As a result of this we would be able to use sklearn to do cross validation for example (something which is currently not supported by the XGBoost module itself: https://github.com/dmlc/xgboost/issues/270). If you like this idea as well, I would be happy to open a PR for this.

bigdong89 commented 6 years ago

you are right. It doesn't support cross validation in sklearn. In fact, making it work is not easy. Please have a try!

In addition, The XGBRanker takes all x as single group when group=None. def fit(self, X, y, group=None, eval_metric=None, sample_weight=None, early_stopping_rounds=None, verbose=True):

FlorisHoogenboom commented 6 years ago

There are two possible solution paths I guess:

We could add the group labels to the target vector y as a second dimension. In this case one could distill the groups from these labels when fit is being called. For the predict method we do not actually need the groups I believe (at least if I understand the LambdaMART model correctly)
A second option would be to add the group labels to the feature matrix X as e.g. a first column. I do think this is more semantically correct, but it might be less clear to the user that this estimator always threaths this first column as the group labels.

In addition (but that's maybe something of later concern) I think it would be nice to have a wrapper for other sklearn estimators so they can be used following the same paradigm. This way it also becomes possible to compare the XGBoostRanker with estimators that are not group-aware.

I've got a proposal for the second path almost ready. I'll submit a PR when I've finished that and then we can have a look whether we want to go with that solution, or take the other path. The changes between both paths are small I believe.

bigdong89 commented 6 years ago

Wonderful idea! Your ideas are very promising!

But I'm still worried about whether it can work correctly. If we ask users to add group labels to X or Y: [1]. Users have to reshape data with original data and group labels before fit/predict is called. [2]. In XGBRanker, data should be splitted to X, Y, group labels before real running. [3]. Group labels refers to element numbers of these groups. It can only handle one by one and doesn't support out-of-order. So there are some problems:

When dataset are enough large, [1],[2] cost some time and memory.
Users have to know and process group labels. It's the same with the original method. I'm still open to more discussions.

In fact, I think this wrapper is very useful too. So when I started this project, I planned to create a wrapper of XGBRanker to support cross validation. At last, I cancelled to realize it as a result of above reasons.

FlorisHoogenboom commented 6 years ago

I've opend a PR with some ideas. Basically what happens is that group labels are passed as a first column in the feature matrix X. Next the function util._preprare_data_in_groups sorts the labels such that samples in the same group are positioned together and calculates the group sizes. Of course, this sorting may be a bit expensive. However, in the original set-up the user had to do sorting the samples per group himself before feeding it into the algorithm.

I think this solution fixxes your points [2] and [3]. Let me know what you think, I would understand if this is not a solution you like! Always happy to incorporate any changes.

bigdong89 / xgboostExtension

sklearn API compatibility #6