david-cortes / contextualbandits

Python implementations of contextual bandits algorithms
http://contextual-bandits.readthedocs.io
BSD 2-Clause "Simplified" License
739 stars 143 forks source link

Basic example with smallest dataset/random rewards (onboarding) #47

Closed qathom closed 2 years ago

qathom commented 2 years ago

Hello @david-cortes, thanks for making contextualbandits! I'm currently exploring solutions to recommend products based on user preferences and it looks like contextual bandits is an interesting approach. In this small gist (105 lines), I built a simulation of conversation turns with updated item scores based on updated user preferences. If the user likes a recommendation (let's say that the system is presenting max 3 items), the RS will take the liked items into consideration by finding similar items (cosine similarity) as well.

I would love to use contextualbandits to run experiments but I'm not sure how to use it with a very simple dataset (like the one in my gist). Basically, a vector of item features. I would be more than grateful if I could get any help or advice.

david-cortes commented 2 years ago

I'm not sure if I understand it correctly.

From what I get, you have a situation in which you have user features + item features and you observe both, and at each turn you make a recommendation for a different user and each potential recommendation has a continuous score, which is not the kind of scenario that this library deals with.

If the item embeddings are supposed to be invisible / not available to the algorithm and there is a threshold on the obtained scores to make them binary (reward vs. no reward), then it would sound like the kind of problem for this library, for which you could treat the items as arms (you'll need to enumerate them) and the user vector as features (you'll need to convert them to a matrix with 1 row, and pass it as numpy array)

But be aware that (a) this software by default will switch to non-contextual MAB when the number of seen data points is small, so if you explicitly want to make it contextual while running for only a few rounds, you'll have to check the specific parameters that you are using, (b) if you know the specific reward-generating function and the algorithm is supposed to be aware of this point, you might want to select a classifier and its hyperparameters accordingly instead of following the example notebooks.

qathom commented 2 years ago

Thanks for your prompt reply @david-cortes, I really appreciate your message! The simulation is for 1 user (user_embeddings). The turns are a conversation simulation where the user gives his/her preferences step by step (comments in lines 83+).

Indeed, the current gist returns a continuous value between 0 and 1 where 1 means 100% match between user preferences and item features but I can maybe try to define a threshold to return 1 or 0 (reward/no reward).

I think contextual MAB makes sense for my project (a chatbot asks questions about user preferences) because the idea is to use a hybrid approach when it comes to recommend items (user-item and item-item filtering). The Gist tries to illustrate both concepts. An idea: the goal of the algo is to find the best "end result" based on the maximization of 2 reward functions (user-item and item-item).