NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
751 stars 113 forks source link

[RMP] Statistical Candidate Generation based on CoVisitation for two-stage recsys #878

Open gabrielspmoreira opened 1 year ago

gabrielspmoreira commented 1 year ago

Problem:

When the item catalog is very large (e.g. tens or hundreds of millions), companies typically implement a two-stage recsys pipeline, where (1) candidate generation models provide a number of candidate items for a given user or session and (2) a ranking model with more powerful features and architecture performs the final ranking. The candidate generation models can be based on heuristics (e.g. most recent items), statistics (e.g. most popular, most co-visited items per session or per user) or machine learning (e.g. word2vec, MF, Two-Tower, YouTubeDNN), among others. This RMP addresses the CoVisitation candidate generation algorithm, which has been very successful in the last competitions (Dressipi RecSys Challenge, Otto Kaggle competition). One example is the 3rd place solution from KGMON + Merlin team for Otto comp, where they used 20 different strategies to generate candidates from CoVisitation.

Goal:

This RMP proposes the implementation of CoVisitation candidate generation into Merlin.

New Functionality

NVTabular

1- CoVisitation Matrix Generation Implement a CoVisitMatrix op in NVT backed by dataframe ops to compute a co-visit matrix (item_a, item_b, weight), allowing to configure Lambda ops for custom co-visit filter (i.e. what should be considered a co-visit, e.g. clicks, only purchases) and weighting strategy (i.e. how to weight multiple co-occurrence of items for the same user/session sequences and for multiple sequences).

Systems

2 - Candidate Generation from a CoVisitation Matrix

3 - Candidate Generation ensemble

Starting Point:

zippeurfou commented 1 year ago

This feature would be very helpful for my team! Having the ability to do 3 can be quite tricky at runtime as well and this is not something I have seen implementations yet. Having a solution there even if naive can help getting a PoC out very quickly! As a side note, I like the grouping between heuristic, statistic and models. I would add to this that this is a great first statistic approach but as described in the original comment there are others and such methods and I can see this being a great first step to introduce the logic and I am hoping this is the first of a series :). I am mentioning this because CG also help to implement certain business logic (eg. if you are cnn.com you might want to have one statistic towards recent news for editorial purpose, another example is to have random being part of the CS for exploration purpose) and can be used to artificially increase side metric (eg. diversity for example MIND does this, coverage...). Today with Merlin it is not possible out of the box to cover theses use cases which is something needed for my team.

karlhigley commented 1 year ago

Since the scores from different candidate generators aren't guaranteed to be comparable, my favorite way to combine them is the "log-rank" trick: take the candidates from each source, compute the log of the items' rank within each candidate set, and then use that as a new score to rank the items when combining the lists. Doing it that way produces scores that are on the same scale and have the same distribution for each candidate source, which makes merging the candidate sets more straightforward.

(Technically, you compute log(max_rank) - log(candidate_item_rank) so that the items at the top of the list have the highest scores and so that summing the scores for an item that appears in multiple candidate sets works the way you'd expect, bumping that item higher up the list.)

zippeurfou commented 1 year ago

Any updates on this one?

karlhigley commented 1 year ago

@gabrielspmoreira This is a pretty big lift, and I think I probably wouldn't implement it exactly as proposed by the issue description. For example, I probably wouldn't try to keep a whole co-visits matrix in memory, I'd likely break it out into a list of most frequently co-visited items for each item and then store that in a key-value store, so that each item's frequent co-visits can be looked up at high-speed/low cost without spending memory on the whole catalog all at once.

I also think "combining candidate sources" could be a whole RMP in and of itself, so we might want to come back to this as a backlog grooming item to refine into smaller chunks of work that are more readily actionable.

zippeurfou commented 1 year ago

@karlhigley I would also mention reciprocal rank fusion or quantile normalization rather than log rank trick. There might need some research there as well to see if there are other strategies.