Open sgaseretto opened 3 years ago
Hi Sebastian - sorry for the delayed response on my end!
I think this idea sounds like a great addition to the library!
The first part of this idea, expanding a model to new users, is something I made some rough pseudocode for in the blog post here, but I think it sounds like a great idea to formalize that into the BasePipeline
class and have this apply to all models. Also, this something that we can eventually address by building off of the cold start work done here: https://github.com/ShopRunner/collie/blob/0d62ae1e0194a64b7f3841b40ea56a573ba95268/collie/model/cold_start_matrix_factorization.py#L53-L54
For the removing users idea - that seems like it should be pretty easy to incorporate into the model. I think it makes sense to just zero-out those embeddings completely for either the user(s) or item(s) to remove, unless we find a clever and efficient way to do the re-indexing for both the Interactions
and model objects together. My only concern with that, as far as GDPR goes, is that technically, we will have stil learned from that user's behavior in the model (item embeddings will be influenced by that user's past behavior) so even if we remove the user from the model, their past behavior will technically still impact the model results until a new one is trained. I'm very curious to hear your thoughts on that - do you think this is a big issue?
If you're up for it, it'd be fantastic if you could contribute this into Collie. If not, we can keep this issue open and it'll be something I try to work on when I get some free time.
Cheers!
First of all, sorry for my really late response to this, the github mail notification got lost between my other mails.
About the GDPR concern, I think (but I'm not a lawyer) that item embeddings that were affected by the deleted users don't represent the user. Even if there was a product that was affected by only one user, making them to be very close together, and in a very exceptional case having the same representation in the embedding space so that their cosine similarity is exactly 1, if you delete the user embedding and it's interactions, there will not be any way to infer that the product representation was influenced by the deleted user.
I think that makes a lot of sense, but it's also worth doing some more extensive analysis when I have time. I think having this functionality in the library makes a lot of sense and could be really helpful!
@nathancooperjones I don't have as much time as I used to but this looks worth tackling. Having looked at the ColdStart model briefly it appears that the solution for adding a new user or a new item to the BasePipeline is to use cosine similarity to find items/users that are similar in the metadata space and get a rec list based on some combination of those known items/users, is that correct?
Looking around statsexchange I found a couple of other possibilities that may not be implementable given Collie's structure/setup but figured I'd throw them in here as well if o get your thoughts in case they may also be options:
@ahuds001 if anyone can take on this work, it is you - thanks for volunteering to look into this!!! 🎉
At my previous company, we used Collie to generate recommendations. The way we generated cold start recommendations was similar to how you mentioned above, but with a slight twist. We trained a normal Collie model with known users and items - when we had a new user/item come in, we looked at its metadata and found the k
most similar users/items included in the trained model. From there, instead of just combining the recommendations for the k
most similar users/items to get the new user's/item's recommendations, instead we combined the embeddings for the k
most similar users/items to get a new embedding representation for the new user/item.
Say we have a new user and, through some heuristics (e.g. for MovieLens, this could be similar location, similar age, similar favorite movie genre, etc.), we determine that the new user is similar to User A, B, and D in our existing model. If User A had a user embedding of [1, 2, 3]
, User B had a user embedding of [2, 3, 4]
, and User D had an embedding of [4, 5, 6]
, then the new user would have an embedding of the mean (or weighted mean, if you know how similar the new user is to each user) of [2.33, 3.33, 4.33]
.
If you don't have additional metadata available to determine similar users/items, I think the technique I used in my initial blog post here seems similar to what I think those StackOverflow posts described - for a new user, optimize the model on a single row keeping the item embeddings fixed. I think this is a good way to add new users/items in the model without requiring a full retrain or additional metadata involved.
Giant information dump here (sorry) - what do you think about these ideas?
I'm implementing the same approach explained by @nathancooperjones both for cold/warm start users and items, and so far it is one of the best approaches I figured out to bootstrap users and embeddings. If you have access to the first searches made by the cold users, you can get the users embeddings that are nearer to the search results with which the new user have interacted, average them and have a new embedding for this new user, that then you can use for the next round of partial training of your model, without needing to randomly initialize this new "warm" user. If using matrix factorization, both averaging the item embeddings or the users embeddings should lead to fairly good results since they are in the same vector space and both groups of embeddings should be "near" each other (haven't done any experiments to prove this, but seems like a fun test to try)
If using matrix factorization, both averaging the item embeddings or the users embeddings should lead to fairly good results since they are in the same vector space and both groups of embeddings should be "near" each other (haven't done any experiments to prove this, but seems like a fun test to try)
I guess I can't share exact numbers or details on this experiment, but at my previous company, both offline metrics and a live A/B test showed this method for handling cold start users/items lead to significantly improved results!
So at a high level this all makes sense, I just need to get my hands dirty. It sounds we would need to have different solutions depending on the Collie model type. Would we still want this live in the BasePipeline or should it live independently within each model type as the data available would be different? It sounds like the latter would make more sense and there would be 2 versions, one for the Hybrid models and one for the basic MatrixFactorization model.
Also I am not even considering the removal of users/items just yet, that should probably be a different PR?
I'm not sure how a cold start solution could work without the additional metadata (given in BasePipeline
), but we could just have a method in BasePipeline
that takes in existing users/items that are known to be similar with the cold start user/item, then aggregate them together to get the embedding value?
What are you thinking for this?
Hmm... I'm missing something, likely just lack of familiarity. Let me start tinkering and I'll come back with some questions in a week or two.
Problem Description
Ideal Solution