fidelity / mabwiser

[IJAIT 2021] MABWiser: Contextual Multi-Armed Bandits Library
https://fidelity.github.io/mabwiser/
Apache License 2.0
213 stars 42 forks source link

Thompson Sampling for Gaussian priors? #49

Closed jaywonchung closed 2 years ago

jaywonchung commented 2 years ago

Hi!

I learned that mabwiser does not implement Thompson Sampling for Gaussian priors. As far as I know (please note that I'm quite new to multi-armed bandits), the Gaussian distribution is a conjugate prior and it's possible to apply the procedure of Thompson Sampling on Gaussian priors. Would the maintainers be open to adding that as a learning policy? Or, is it the case that the optimality of Thompson sampling is only proved for Beta priors?

Thanks a lot!

bkleyn commented 2 years ago

@jaywonchung, apologies for the slow response.

I think adding the option to use Gaussian priors for Thompson Sampling would be a useful addition.

From an implementation point of view, I don't think it necessarily has to be a new policy as we could also add this as an option to the current Thompson Sampling implementation.

In terms of the algorithm, I think Algorithm 2 (page 4) in these lecture notes should be straight-forward to implement and would work well in our current design.

Let us know what you think.

jaywonchung commented 2 years ago

No worries. Thanks for providing info on implementing this!

Actually in the meantime, I created something from scratch on my own, not using mabwiser at the moment. I found it a bit difficult to extend mabwiser "from the outside", i.e. extending it without modifying the source code of mabwiser.

I took at quick look at the lecture note. Looks like this assumes unit variance for the reward of each arm. My current implementation allows the user to specify a different variance for each arm, and a Normal distribution of the mean reward \mu is updated based on observations using the conjugate prior updating rule I found in WIkipedia. The prior distribution is set to be uninformative (infinite variance). I think fixing the reward variance to some constant would make it susceptible to scale problems for online scenarios where you have no idea about the scale of the reward before observing samples. Would do you think?

I can create a PR with this when I get free time. I have a paper deadline (April 20th) and I cannot really guarantee you that I'll be able to do it before that, since I also need to document it well and write tests.

bkleyn commented 2 years ago

You're right that the algorithm in the lecture notes use a uninformative Normal(0, 1) prior (i.e., zero mean and unit variance for all arms). However, the posterior variance for each arm will depend on the number of decisions (samples) observed. I think this is consistent with the current Thompson Sampling implementation in MABWiser that starts with a Beta(1, 1) prior which is updated such that the posterior for each arm will have a different mean and variance from the same distribution.

No rush on this. Happy to pick up later if/when you more time.

skadio commented 2 years ago

@jaywonchung good luck with your submission! This of course takes high priority with or without mabwiser.

I am enjoying the technical brainstorming on the thread, and others might also find it useful. If it works out, especially it works well in your own experiments/paper, I would encourage the introduction of a new bandit as suggested.

We have a detailed write-up that shows a step-by-step intro to how to add a new bandit to mabwiser here https://fidelity.github.io/mabwiser/new_bandit.html. When you go that route, I can see us collaborating on the same PR --which might be fun! :)

Good luck with your paper!

Serdar

jaywonchung commented 2 years ago

@skadio Thanks!! đź‘Ť

@bkleyn

I thought the uninformative Beta(1, 1) prior (which assigns identical likelihood to all possible p values) counterpart for our case where the likelihood function is Gaussian with unknown mean and known variance should be Normal(0, inf). Reference: Section 2.7 of https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf. Though this is an improper prior, it's posterior will become proper given at least one observation (with mean equal to the observation and variance equal to the reward distribution's variance). Assuming this makes sense, I think we might still allow the user to specify the mean and variance of the prior distribution of mu, but default to 0 and inf if not given. Initial calls to predict when there are still arms with improper priors should be treated as part of an initial exploration stage, where we must try one random cold arm.

About the unit variance I mentioned in the previous comment, I was actually talking about our reward distribution (Gaussian with unknown mean (mu) and known variance), not the prior distribution for mu. Say for instance a use case of this bandit observes rewards [1, 2, 3, 4, 5] (variance 2), and another different use case observes [100, 200, 300, 400, 500] (variance 20000), and this difference in "scale" might not known when the bandit is constructed (a purely online case where there is no training set). Forcing the user to select unit variance for the reward distribution considering this scale problem seemed suboptimal to me. If so, we should allow the user to choose the variance of the reward distribution.

skadio commented 2 years ago

Hope all is well, and your submission is now all good! Let’s us archive this issue now for future reference inc case it comes handy for others. Thank you again for the positive feedback!