ShopRunner / collie

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.
https://collie.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
107 stars 20 forks source link

Adding new users and items to an existing model or removing old ones #28

Open sgaseretto opened 3 years ago

sgaseretto commented 3 years ago

Problem Description

Imagine you have trained a model that is pretty good on recommending items to your users. Then new users arrive to your system, you collect this new interactions, also you have increased your catalog of products and want to also recommend them to your users. How can your already pretrained model and initialize new embeddings for this new users and items (extend your embedding layer-table). Also perform the reverse, some users deleted their accounts and (because of GDPR or some other regulation) request their data to be removed, and some of your products are discontinued, so you don't need their embedding representations anymore. In other words the ability to add new embeddings to an existing, already pretrained model, or to shrink it by deleting unnecessary embeddings.

Ideal Solution

It will be nice to have some methods that allow us to add and remove users and items embeddings and re-index them in an already pretrained model.

nathancooperjones commented 3 years ago

Hi Sebastian - sorry for the delayed response on my end!

I think this idea sounds like a great addition to the library!

The first part of this idea, expanding a model to new users, is something I made some rough pseudocode for in the blog post here, but I think it sounds like a great idea to formalize that into the BasePipeline class and have this apply to all models. Also, this something that we can eventually address by building off of the cold start work done here: https://github.com/ShopRunner/collie/blob/0d62ae1e0194a64b7f3841b40ea56a573ba95268/collie/model/cold_start_matrix_factorization.py#L53-L54

For the removing users idea - that seems like it should be pretty easy to incorporate into the model. I think it makes sense to just zero-out those embeddings completely for either the user(s) or item(s) to remove, unless we find a clever and efficient way to do the re-indexing for both the Interactions and model objects together. My only concern with that, as far as GDPR goes, is that technically, we will have stil learned from that user's behavior in the model (item embeddings will be influenced by that user's past behavior) so even if we remove the user from the model, their past behavior will technically still impact the model results until a new one is trained. I'm very curious to hear your thoughts on that - do you think this is a big issue?

If you're up for it, it'd be fantastic if you could contribute this into Collie. If not, we can keep this issue open and it'll be something I try to work on when I get some free time.

Cheers!

sgaseretto commented 2 years ago

First of all, sorry for my really late response to this, the github mail notification got lost between my other mails.

About the GDPR concern, I think (but I'm not a lawyer) that item embeddings that were affected by the deleted users don't represent the user. Even if there was a product that was affected by only one user, making them to be very close together, and in a very exceptional case having the same representation in the embedding space so that their cosine similarity is exactly 1, if you delete the user embedding and it's interactions, there will not be any way to infer that the product representation was influenced by the deleted user.

nathancooperjones commented 2 years ago

I think that makes a lot of sense, but it's also worth doing some more extensive analysis when I have time. I think having this functionality in the library makes a lot of sense and could be really helpful!

ahuds001 commented 2 years ago

@nathancooperjones I don't have as much time as I used to but this looks worth tackling. Having looked at the ColdStart model briefly it appears that the solution for adding a new user or a new item to the BasePipeline is to use cosine similarity to find items/users that are similar in the metadata space and get a rec list based on some combination of those known items/users, is that correct?

Looking around statsexchange I found a couple of other possibilities that may not be implementable given Collie's structure/setup but figured I'd throw them in here as well if o get your thoughts in case they may also be options:

nathancooperjones commented 2 years ago

@ahuds001 if anyone can take on this work, it is you - thanks for volunteering to look into this!!! 🎉

At my previous company, we used Collie to generate recommendations. The way we generated cold start recommendations was similar to how you mentioned above, but with a slight twist. We trained a normal Collie model with known users and items - when we had a new user/item come in, we looked at its metadata and found the k most similar users/items included in the trained model. From there, instead of just combining the recommendations for the k most similar users/items to get the new user's/item's recommendations, instead we combined the embeddings for the k most similar users/items to get a new embedding representation for the new user/item.

Say we have a new user and, through some heuristics (e.g. for MovieLens, this could be similar location, similar age, similar favorite movie genre, etc.), we determine that the new user is similar to User A, B, and D in our existing model. If User A had a user embedding of [1, 2, 3], User B had a user embedding of [2, 3, 4], and User D had an embedding of [4, 5, 6], then the new user would have an embedding of the mean (or weighted mean, if you know how similar the new user is to each user) of [2.33, 3.33, 4.33].

If you don't have additional metadata available to determine similar users/items, I think the technique I used in my initial blog post here seems similar to what I think those StackOverflow posts described - for a new user, optimize the model on a single row keeping the item embeddings fixed. I think this is a good way to add new users/items in the model without requiring a full retrain or additional metadata involved.

Giant information dump here (sorry) - what do you think about these ideas?

sgaseretto commented 2 years ago

I'm implementing the same approach explained by @nathancooperjones both for cold/warm start users and items, and so far it is one of the best approaches I figured out to bootstrap users and embeddings. If you have access to the first searches made by the cold users, you can get the users embeddings that are nearer to the search results with which the new user have interacted, average them and have a new embedding for this new user, that then you can use for the next round of partial training of your model, without needing to randomly initialize this new "warm" user. If using matrix factorization, both averaging the item embeddings or the users embeddings should lead to fairly good results since they are in the same vector space and both groups of embeddings should be "near" each other (haven't done any experiments to prove this, but seems like a fun test to try)

nathancooperjones commented 2 years ago

If using matrix factorization, both averaging the item embeddings or the users embeddings should lead to fairly good results since they are in the same vector space and both groups of embeddings should be "near" each other (haven't done any experiments to prove this, but seems like a fun test to try)

I guess I can't share exact numbers or details on this experiment, but at my previous company, both offline metrics and a live A/B test showed this method for handling cold start users/items lead to significantly improved results!

ahuds001 commented 2 years ago

So at a high level this all makes sense, I just need to get my hands dirty. It sounds we would need to have different solutions depending on the Collie model type. Would we still want this live in the BasePipeline or should it live independently within each model type as the data available would be different? It sounds like the latter would make more sense and there would be 2 versions, one for the Hybrid models and one for the basic MatrixFactorization model.

Also I am not even considering the removal of users/items just yet, that should probably be a different PR?

nathancooperjones commented 2 years ago

I'm not sure how a cold start solution could work without the additional metadata (given in BasePipeline), but we could just have a method in BasePipeline that takes in existing users/items that are known to be similar with the cold start user/item, then aggregate them together to get the embedding value?

What are you thinking for this?

ahuds001 commented 2 years ago

Hmm... I'm missing something, likely just lack of familiarity. Let me start tinkering and I'll come back with some questions in a week or two.