Scaling Multi-Armed Bandits

skonto commented 5 years ago

Hi guys!

I know Seldon supports custom routers but how about adding this nice work to the core: https://dbis.ipd.kit.edu/download/S-MAB_FOUCHE_KDD19.pdf https://github.com/edouardfouche/S-MAB

It would be ideal if you do parallel testing for a big number of models and at the end of the day you want to quickly eliminate the poor ones in production.

jklaise commented 5 years ago

Thanks for the interesting reference @skonto !

If I understand correctly, this is a way of dynamically choosing 0<=L<=K arms at each round to maximise the reward. This is quite different from the usual case where the choice consists of choosing 1 of the K arms at each round. What I'm wondering is, how would choosing >1 arms at each round correspond to the ML routing use case? Given more than 1 model predictions, one still has to somehow make 1 prediction/recommendation in production (i.e. down the line there is only one source of reward for each instance the prediction is made for).

skonto commented 5 years ago

hi @jklaise! My idea is as follows. In the multi-arm bandit problem (ε-Greedy strategy) you want to explore with some probability ε and exploit with some probability 1-ε. You can have a better choice by guessing which action is optimal (Thompson Sampling). To do this dynamically as stream data arrives you can select the algorithm 4 in the paper. You still select one arm to get some prediction, but you keep an ADWIN instance per arm(model) with a time window being the smallest for all instances as some instances will not be updated every time (only the selected one provides a reward). The idea is to use the KL-S policy to dynamically adjust the set of the arms you explore. At some point bad models should stabilize with a bad probability of reward and you end up with a few candidates. That is my intuition here. Btw the multi-armed bandit approach maybe a slow process as indicated it here , I am wondering how fast it converges. Probably you need to do a null hypothesis testing about average performance of the ideal model vs the challenger you picked up.

jklaise commented 5 years ago

@skonto , very interesting, thanks for the detailed response. I'm still not quite sure how this would work in the model selection setting as the bandits discussed here are MP-MABs (multiple play bandits, e.g. at every round you select L>=1 arms not just one and consequently you have to receive a reward for each of the L arms pulled). Unless we stick to L=1 with ADWIN to have an adaptive MAB.

Re. ADWIN, we're actually thinking of implementing this as a simple concept drift method in Alibi, so this could align well for the future.

Finally, I find that the whole debate of A/B testing vs MAB bandits is very application dependent. If you want estimates of which variant is better, by all means use A/B testing (and definitely go Bayesian! frequentist A/B testing will not give you the answers you want...), but my opinion is that for a lot of business use cases this is not what is actually needed, so an online MAB algorithm would be preferable. In the end A/B testing is just a MAB with a fixed pure exploration phase followed by pure exploitation.

skonto commented 5 years ago

Unless we stick to L=1 with ADWIN to have an adaptive MAB.

@jklaise Thanks for sharing your thoughts! That is one option, to use adwin to compare models (usually 2, champion and challenger) average performance in parallel and decide after sometime about the best one right? Or use adwin for adapting which subset of models you pick for the exploration phase.

The other option I was thinking about is either a) propagate the previous reward values to the next round, since this is the current info you have for every arm and algo MP-TS(Lt) line 14 only requires an observation reward vector, it does not check on what has been updated, or b) do the rounds let's say at the end of a window (t increases) where you have collected enough rewards assuming adwin didnt detect any change in the average value of an arm. If the latter happens and a change is detected we could fall back to a) for the arms that didnt have rewards. My goal is to find the optimal number of arms I need to keep for the next round (scale-up or dont scale down). I understand that the model selection problem is different form the normal problem where you get multiple-arms/rewards (eg. multiple bets in a casino) but my goal is to emulate the latter so I end-up with a subset of models (when you have many of them) as good alternatives to the champion faster. Otherwise if you have many arms (as stated in the paper with data monitoring) you will need to update their rank at any time O(n), where n is the total number of arms and finally select the best. With this approach I start with n'<n and try to end-up with only a subset dynamically. My intuition is that if I start with n' models I could still leave out some good ones if I start with a small subset of n but at the end my objective is satisfied. I dont need all the models eg. may have resource restrictions and cant load them all. The problem is how you pick n' or form that subset dynamically.

skonto commented 5 years ago

Re. ADWIN, we're actually thinking of implementing this as a simple concept drift method in Alibi, so this could align well for the future.

We could collaborate on that.

jklaise commented 5 years ago

@skonto I need to think a bit more about the MAB use case. I agree that in practice the "rounds" do not correspond to the usual action->immediate reward, but are more likely to have either delayed rewards or aggregated rewards (e.g. it is common for A/B testing use cases to evaluate "rewards" at fixed time intervals). In this case the method of choosing a subset of L<K arms to play during a given period before receiving rewards would make sense.

About ADWIN and Alibi, it would be great! We are quite early so have not decided what the API would look like (suggestions welcome), but there are some inspirations from skmultiflow and also creme. I think the key will be making it quite general, e.g. ability to track multiple summary statistics of either the data x or the labels y or the conditional distribution p(y|x) (which is the real concept drift definition).

ukclivecox commented 5 years ago

We'd like to return to MABs in the spec we are evaluating : https://github.com/SeldonIO/mlgraph If you would be interested in collaborating on this please contact us.

skonto commented 4 years ago

@cliveseldon sure I am interested in collaborating on this. What is the best way to communicate with you? Slack?

ukclivecox commented 4 years ago

Hi @skonto . Yes a slack conversation initially and then we can take it from there?

seldondev commented 4 years ago

Issues go stale after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

ukclivecox commented 2 years ago

Will want to revisit MABs with V2 and Iter8 integration,

ukclivecox commented 1 year ago

We should revisit in v2 roadmap closing for now

SeldonIO / seldon-core

Scaling Multi-Armed Bandits #666