VowpalWabbit / coba

Contextual bandit benchmarking
https://coba-docs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
48 stars 19 forks source link

IPS convergence #35

Closed jonastim closed 1 year ago

jonastim commented 1 year ago

Checking out the latest code after merging this PR I realized that the IPS reward estimates for the random policy weren't converging to the expected value.

Concerned that some of the recent changes introduced an issue I checked out the the commit for which proper convergence was observed as discussed here (this PR is branched off of this commit).

However, the behavior shown in these screenshots couldn't be replicated and the issue that has been observed with 5000 samples remained with 10-40x the data as shown in this notebook.

Screenshot 2023-02-23 at 10 34 22 AM Screenshot 2023-02-23 at 11 00 31 AM
jonastim commented 1 year ago

@mrucker, I am a bit puzzled by this. The IPS reward was converging in a bunch of runs a couple of days ago but now I can't even replicate that with the commit from back then, tested up to 200k samples. Do you have any idea what might be going wrong here?

mrucker commented 1 year ago

Hmmm, that is concerning. I'm not sure. I just made a small test environment of my own and it seems to be working...

I'll checkout your code tomorrow or Thursday and see if any problems jump out at me.

mrucker commented 1 year ago

image

This is with 40 actions and binary rewards so random should be 1/40 in the limit.

mrucker commented 1 year ago

(I'm assuming the action index thing didn't fix this problem?)

mrucker commented 1 year ago

Alright, I've been playing with this a lot today... I'm 99% sure the problem here has to do with the random number generation...

image

Notice the red-line (offline evaluation using IPS) matches the blue-line (online logged learner) when I use a new seed (which is correct, we want the offline estimation to equal the online performance for random). When we use the same seed as the logging policy and replay the logged data in the same order we get a huge over-estimation (orange). The orange line will never converge because the seeds will always play the same random action forever. I think this is what you're seeing now.

The green line uses the same seed as the logging policy but shuffles the order of replayed interactions. I thought shuffling would be enough to break correlation due to seeding but it doesn't look like it. Even with shuffling we still get an fairly over-optimistic estimate via IPS given this short time frame. Given enough examples I'm fairly certain the green line would eventually converge to the blue and red line which I suspect is what you had been doing previously when it would converge by 100,000 or 500,000 examples.

So, I need to change in a moderately fundamental way how CobaRandom interacts with learners in order for IPS estimation using coba generated log data to be unbiased. For what it is worth, this shouldn't impact your production data because that data was generated from a completely independent stochastic process (in case you wanted to use IPS there). I should have time this weekend to fix it.

jonastim commented 1 year ago

Oh wow, super interesting!

I tried different random seeds for the data generation

rng = cb. CobaRandom(2) #so the simulation is repeatable
for _ in range(self._n_interactions):
    features = rng.randoms(3)

and for the learner arguments, e.g. cb.VowpalOffPolicyLearner([1, 'x', 'a', 'ax', 'axx'], seed=5).

Screenshot 2023-03-01 at 3 08 23 PM

The results were fairly similar suggesting that something deeper is going wrong in the random data generation as you alluded to. Glad to hear that this shouldn't affect the production data replay which unblocks me from sharing it with some partner teams.

mrucker commented 1 year ago

tldr;

Yes, there is a deeper random selection process beyond the seeds in learner init. To fix this I'm going to make the init seeds modify that deeper process in addition to the surface process. Then this problem will be easily solved by changing the seed in the learners.


Exactly, there is a deeper process here. VW doesn't actually pick actions to play. Instead it picks a probability for each action. Changing the seed for VW simply changes the probabilities that are returned from VW for each action and not the actual action that is played (changing the probabilities may change the action that is picked but also may not). In fact no out of the box learner in Coba picks which action to play. Every out of the box learner simply generates probabilities for each action each time you call predict.

Once we have the probabilities the same seed is then used to pick an action. So using some abuse of notation its like this:

In theory each learner is going to produce a different PMF over the actions so different actions will be chosen but... This set up leaves a very high-likelihood there will be correlation between the learners because internally choice works by determining the cdf and then generating a random number between [0,1]. So, this deeper process of choosing from the PMFs is what is connecting all the different learners even when you change the learner's seeds.

This deeper similarity is actually good when comparing learners online because it generates much lower variance between learners performance. What we're seeing now though is that it becomes a problem in offline. So I just have to pull the action selection into each learner so the seed also picks the action rather than just determining the action probabilities. Then we can easily break this correlation in offline analysis by changing seeds in __init__.

jonastim commented 1 year ago

Interesting insights! Have you already had a chance to look at a fix for it?

mrucker commented 1 year ago

Yeah, I actually just pushed the fixes. I included a collection of off-policy evaluation improvements with this.

I'll be releasing these fixes to pypi tomorrow as version 6.4.0.

In short you can fix the old behavior in 1 of 2 ways. First, here's a problematic situation. If you run this you will get rwd~1

import coba as cb

shared_seed = 1

env = cb.Environments.from_linear_synthetic(1000, n_actions=10, n_action_features=0).binary()
env = env.logged(cb.RandomLearner(),seed=shared_seed,rewards="IPS") #turn it into a logged interactions

cb.Experiment(env,cb.RandomLearner()).run(seed=shared_seed).plot_learners()

To fix the above code you can either explicitly use different seeds (you'll want to be sure to do this if you're making logs from Experiment)

import coba as cb

seed1 = 1
seed2 = 2

env = cb.Environments.from_linear_synthetic(1000, n_actions=10, n_action_features=0).binary()
env = env.logged(cb.RandomLearner(),seed=seed1,rewards="IPS")

cb.Experiment(env,cb.RandomLearner()).run(seed=seed2).plot_learners()

Or you can shuffle (I fixed the problem with shuffling not fixing the problem previously) in which case seed doesn't matter.

import coba as cb

shared_seed = 1

env = cb.Environments.from_linear_synthetic(10_000, n_actions=10, n_action_features=0).binary()
env = env.logged(cb.RandomLearner(),seed=shared_seed,rewards="IPS").shuffle(shared_seed)

cb.Experiment(env,cb.RandomLearner()).run(seed=shared_seed).plot_learners()
mrucker commented 1 year ago

(sorry for the bit of delay on this, I've been working on wrapping up another paper for grad school.)

(also Experiment().run() and the Logged filter didn't used to accept seed, so this wasn't possible previously.)

jonastim commented 1 year ago

Awesome, thank you very much! Just confirmed the different seeds approach in my experiment.

Screenshot 2023-03-14 at 8 46 38 AM
mrucker commented 1 year ago

Yay! The moving window view is interesting isn't it? It can really show a ton of hidden performance variation. Also, it looks like you're using the out=None to jam together two plots. You might know this, but if you don't, you can manually specify a list of colors or a starting color index to make it so that the two separate plots have different line colors.