catboost / catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
https://catboost.ai
Apache License 2.0
7.9k stars 1.17k forks source link

Ranking Mode - mixing absolute ground truth and pair labels #841

Closed bduclaux closed 5 years ago

bduclaux commented 5 years ago

Hello,

I'm using catboost for a learning to rank model inside a specialized search engine. We have defined a ground truth training set, with absolute rankings from 1 (bad) to 5 (highly relevant). Catboost is able to produce a very good ranking model by auto-generating pairs based on our training set.

Now, we would like to incorporate user signals to enhance our rankings based on user feedback. Such feedback are based on pairs. For instance, an user has preferred result at position 4 instead of result between positions 1 and 3. So training data is actually made of four pairs (4,1) , (4,2), (4,3).

I would like to know if it is possible to mix in Catboost absolute datapoints (such as 1:bad up to 5:highly_relevant for a specific document) and pair-based datapoints (such as the pairs (4,1), (4,2), etc.)

If not, what is the right approach to integrate inside CatBoost user signals based on pairwise ranking when an absolute ground truth is already available ?

I was thinking about writing a pair generator based on absolute ranking in order to convert them into pairs and feed that to CatBoost in addition to user pair-based datapoints, but there might be a better approach.

Thanks for your help, and congrats for making such an amazing ML engine !

annaveronika commented 5 years ago

There is currently no mode that allows simultaneous optimization of pairwise logloss and ranking loss. What you can do - you can train one pairwise mode and one ranking, then normalize them and make a sum of the models with coefficients. Other good option is the one tha you've mentioned - to generate pairs automatically based on your labels.

Note that if you are using ranking and you have categorical features then it is better to provide label values if you have them. So even if you train a pairwise mode, it is a good idea to still provide the label values.

annaveronika commented 5 years ago

As for your initial request mix of the losses, we might do this at some point, thank you for your suggestion!

bduclaux commented 5 years ago

Thanks a lot for your answer, very clear.

So if choose the approach to pre-generate all pairs, is it possible to assign a lower weight to the ones coming from users instead of the ones coming from our ground truth labels (like 0.1 for user pairs, and 0.9 for ground truth) ?

In the case we end up with hundred of millions of pairs, would you recommend a sampling approach to accelerate computations (ie feed only let's say 5% of the pairs) ? We use GPU, but I'm afraid to feed 500m pair records to catboost :-)

annaveronika commented 5 years ago

It is very important that you use GroupId column, because parallelisation in catboost uses it. So put all the objects that make pairs with each other in a single group. We will automate this at some point, but it's not done yet.

Yes, it is possible to use pair weights, if you are using cmdline, you can set a third column for weights (https://catboost.ai/docs/concepts/input-data_pairs-description.html)

I suggest for you to try with not a lot of sampling first, you can also use several GPU-s in one machine, if one is not enough. Also if you will have out of memory, you can switch from more powerful PairLogitPairwise to PairLogit. If it will not fit into memory you will have to sample, but I don't have a good advice for the sampling strategy.

bduclaux commented 5 years ago

Thanks a lot, all clear ! I will keep you posted if we find issues. We use command line and dedicated processing pipelines, Python is too slow for our applications :-)