VowpalWabbit / coba

Contextual bandit benchmarking
https://coba-docs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
48 stars 19 forks source link

add context to default interaction terms #39

Closed jonastim closed 1 year ago

jonastim commented 1 year ago

It's fairly risky to change the default behavior of learners but I've been pulling my hair why some of my models weren't learning anything with the synthetic data generation just to realize it's because I've used the default interaction terms instead of my manually provided list.

Is there a reason for omitting the context term?

Screenshot 2023-04-17 at 9 49 42 PM
mrucker commented 1 year ago

You know I'd noticed the same behavior... I don't have a good explanation for why it helps so much with your test environment...

The reason we exclude it is because in theory we don't need it... Some of my collaborators on the VW team have recommended excluding it on some other projects. Internally vw just does linear regression so in a specific context all actions we are evaluating should have the same "x" making it irrelevant when VW predicts the expected reward for each action.

I'm currently running a large experiment on 208 datasets to compare the current default to what you are proposing (I've never actually done this before). We'll know by tomorrow morning which default does best on these datasets. I'm also running each dataset with 50 shuffles so we'll be able to put a pretty tight CI on the difference as well.

Here is the result so far. The green dots are datasets where the current default does better. The blue dots are where the new proposed default does better. Each dot is a dataset and the dots show the difference in total average reward for the two proposed learners on each dataset.

image

mrucker commented 1 year ago

It's been bugging me that your test environment needs 'x'. I figured out why this morning. Your actions should be tuples.

Change:

To:

With that change you won't need 'x'. I'm happy to explain what's going on if it doesn't make sense.

Also, for what it is worth, here are the final results of my experiments last night:

By Dataset: image

By Training Example: image

jonastim commented 1 year ago

Oh, very interesting! Thanks for doing the analysis! Looks like this change is unnecessary.

I don't quite understand why actions=[0, 1] doesn't work. I could replicate the learners working without needing the x term for the tuples as well as actions=["a", "b"]. I suspect it has something to do with 0/1 being processed in an ordinal way but couldn't spot where.

mrucker commented 1 year ago

Yes,

The way the action values are processed is internal to VW so you wouldn't see it anywhere in COBA.

In short, the features of the linear regressor in the old way would look like this (assuming we only have xa):

  1. if action = 0 then features are [0*f1, 0*f2, 0*f3] = [0,0,0]
  2. if action = 1 then features are [1*f1, 1*f2, 1*f3] = [f1,f2,f3]

This means that VW will always predict a reward of 0 for action=0 because [0,0,0] @ [w1,w2,w3] = 0 for any coefficients.

When we make action a tuple it looks like this to VW

  1. if action = (1,0) then features are [1*f1, 1*f2, 1*f3, 0*f1, 0*f2, 0*f3] = [f1,f2,f3,0,0,0]
  2. if action = (0,1) then features are [0*f1, 0*f2, 0*f3, 1*f1, 1*f2, 1*f3] = [0,0,0,f1,f2,f3]

With these features it is possible for VW to learn a reward for the first action using the first three coefficients.

It is interesting that adding simple [x, ax] works because that means here is what VW saw

  1. if action = 0 then features are [f1, f2, f3, 0*f1, 0*f2, 0*f3] = [f1,f2,f3, 0, 0, 0]
  2. if action = 1 then features are [f1, f2, f3, 1*f1, 1*f2, 1*f3] = [f1,f2,f3,f1,f2,f3]

In this case the optimal solution could be achieved if VW learned to use the first three coefficients to predict reward for action = 0 and then used the second three coefficients to learn to predict the difference between action 0 and action 1. This seems to be what VW does given that it learns a good policy.

Does that make sense?

mrucker commented 1 year ago

(oh and when you use string values for the actions such as "a" and "b" VW automatically one hot encodes the actions internally)

jonastim commented 1 year ago

That makes sense (does it? 😅), thanks for the explanation! I guess we have to be careful to cast actions as strings if their numeric values don't carry particular meaning.