[RMP] Statistical Candidate Generation based on CoVisitation for two-stage recsys

gabrielspmoreira commented 1 year ago

Problem:

When the item catalog is very large (e.g. tens or hundreds of millions), companies typically implement a two-stage recsys pipeline, where (1) candidate generation models provide a number of candidate items for a given user or session and (2) a ranking model with more powerful features and architecture performs the final ranking. The candidate generation models can be based on heuristics (e.g. most recent items), statistics (e.g. most popular, most co-visited items per session or per user) or machine learning (e.g. word2vec, MF, Two-Tower, YouTubeDNN), among others. This RMP addresses the CoVisitation candidate generation algorithm, which has been very successful in the last competitions (Dressipi RecSys Challenge, Otto Kaggle competition). One example is the 3rd place solution from KGMON + Merlin team for Otto comp, where they used 20 different strategies to generate candidates from CoVisitation.

Goal:

This RMP proposes the implementation of CoVisitation candidate generation into Merlin.

New Functionality

NVTabular

1- CoVisitation Matrix Generation Implement a CoVisitMatrix op in NVT backed by dataframe ops to compute a co-visit matrix (item_a, item_b, weight), allowing to configure Lambda ops for custom co-visit filter (i.e. what should be considered a co-visit, e.g. clicks, only purchases) and weighting strategy (i.e. how to weight multiple co-occurrence of items for the same user/session sequences and for multiple sequences).

Systems

2 - Candidate Generation from a CoVisitation Matrix

Implement a CandidatesFromCoVisit op for NVTabular & Systems that takes one Co-visit matrix dataframe and a column with the sequence of session/user items, and optional Lambda op for custom weight aggregation. It outputs two list features (candidate item ids and candidate item scores)for each session/users based, sorted by the candidate scores

3 - Candidate Generation ensemble

Create an op for Systems that allows combining candidate items from many candidate generation models (whether CoVisitation or models like MF,TwoTower), providing some option to choose how to score the candidates (e.g. frequency, sum of candidate scores, etc)

Starting Point:

Description of the 3rd place solution from KGMON + Merlin team for Otto comp, by @cdeotte
- Fast code to produce co-visit is with cuDF and GPU is here (anything named "GPU-XXX): link
- Fast code to apply co-visit to user history is here (code named "model-709"): link
Slack threads with more info about this proposal: here and here

zippeurfou commented 1 year ago

This feature would be very helpful for my team! Having the ability to do 3 can be quite tricky at runtime as well and this is not something I have seen implementations yet. Having a solution there even if naive can help getting a PoC out very quickly! As a side note, I like the grouping between heuristic, statistic and models. I would add to this that this is a great first statistic approach but as described in the original comment there are others and such methods and I can see this being a great first step to introduce the logic and I am hoping this is the first of a series :). I am mentioning this because CG also help to implement certain business logic (eg. if you are cnn.com you might want to have one statistic towards recent news for editorial purpose, another example is to have random being part of the CS for exploration purpose) and can be used to artificially increase side metric (eg. diversity for example MIND does this, coverage...). Today with Merlin it is not possible out of the box to cover theses use cases which is something needed for my team.

karlhigley commented 1 year ago

Since the scores from different candidate generators aren't guaranteed to be comparable, my favorite way to combine them is the "log-rank" trick: take the candidates from each source, compute the log of the items' rank within each candidate set, and then use that as a new score to rank the items when combining the lists. Doing it that way produces scores that are on the same scale and have the same distribution for each candidate source, which makes merging the candidate sets more straightforward.

(Technically, you compute log(max_rank) - log(candidate_item_rank) so that the items at the top of the list have the highest scores and so that summing the scores for an item that appears in multiple candidate sets works the way you'd expect, bumping that item higher up the list.)

zippeurfou commented 1 year ago

Any updates on this one?

karlhigley commented 1 year ago

@gabrielspmoreira This is a pretty big lift, and I think I probably wouldn't implement it exactly as proposed by the issue description. For example, I probably wouldn't try to keep a whole co-visits matrix in memory, I'd likely break it out into a list of most frequently co-visited items for each item and then store that in a key-value store, so that each item's frequent co-visits can be looked up at high-speed/low cost without spending memory on the whole catalog all at once.

I also think "combining candidate sources" could be a whole RMP in and of itself, so we might want to come back to this as a backlog grooming item to refine into smaller chunks of work that are more readily actionable.

zippeurfou commented 1 year ago

@karlhigley I would also mention reciprocal rank fusion or quantile normalization rather than log rank trick. There might need some research there as well to see if there are other strategies.

NVIDIA-Merlin / Merlin