bandit replay environment

jonastim commented 1 year ago

Capability to replay bandits observations from a pandas data-frame

jonastim commented 1 year ago

This is still work in progress and I'll make the interface more consistent with the other environments.

Raised this to discuss a couple of questions

The docs reference LoggedEnvironments but I couldn't find it in code. Potentially this class could be named this way as it's just generically replaying LoggedInteractions.
Something is going wrong with the iterator after the first learner and environment. It processes the first 2000 samples defined in take (out of the 5000 in the dataframe). Is there some kind of reset function for the source to make sure the same samples are processes across the different environments and learners?

d = pd.read_csv("lambda_logs.csv")  # 5000 interactions exported from lambda simulation interactions

environments = Environments([BanditReplay(d, take=2000, actions=[0,1])]).shuffle(n=4)
learners = [
    VowpalEpsilonLearner(features=[1, 'x', 'a', 'ax']),
    VowpalSoftmaxLearner(features=[1, 'x', 'a', 'ax']),
    VowpalBagLearner(features=[1, 'x', 'a', 'ax']),
]
learners.append(RandomLearner())

result = Experiment(
    environments,
    learners,
    evaluation_task=SimpleEvaluation(record=['reward','probability','action','context', 'ope_loss'])
).run()

2023-02-07 12:00:58 -- pid-50779  -- Processing chunk...
2023-02-07 12:00:58 -- pid-50779  --   * Recording Learner 0 parameters... (0.0 seconds) (completed)
2023-02-07 12:00:58 -- pid-50779  --   * Recording Learner 1 parameters... (0.0 seconds) (completed)
2023-02-07 12:00:58 -- pid-50779  --   * Recording Learner 2 parameters... (0.0 seconds) (completed)
2023-02-07 12:00:58 -- pid-50779  --   * Recording Learner 3 parameters... (0.0 seconds) (completed)
2023-02-07 12:00:58 -- pid-50779  --   * Recording Environment 0 statistics... (0.0 seconds) (completed)
/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_utils.py:606: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for item in s.iteritems():
2023-02-07 12:02:24 -- pid-50779  --   * Peeking at Environment 0... (86.54 seconds) (completed)
2023-02-07 12:02:25 -- pid-50779  --   * Evaluating Learner 0 on Environment 0... (0.42 seconds) (completed)
2023-02-07 12:03:05 -- pid-50779  --   * Peeking at Environment 0... (40.45 seconds) (completed)
2023-02-07 12:03:05 -- pid-50779  --   * Environment 0 has nothing to evaluate (this is likely due to having too few interactions).
2023-02-07 12:03:05 -- pid-50779  --   * Recording Environment 1 statistics... (0.0 seconds) (completed)
2023-02-07 12:03:08 -- pid-50779  --   * Peeking at Environment 1... (2.64 seconds) (completed)
2023-02-07 12:03:08 -- pid-50779  --   * Environment 1 has nothing to evaluate (this is likely due to having too few interactions).

@mrucker

mrucker commented 1 year ago

Yo I'm making a ton of assumptions looking at your stub. Just in case, if the idea is to "replay" logged data from an experiment there's similar existing functionality in the logged filter for environments. It may not be what is needed, but I just wanted to make sure you knew about it. It's only like 1 month old so it isn't documented anywhere.

Here's a quick example.

import coba as cb

env = cb.from_openml(150).logged(cb.RandomLearner()).save("save_to_disk_so_we_only_need_to_create_once.zip")
lrn = cb.VowpalEpsilonLearner()

off_policy_result = cb.Experiment(env,lrn).run()
on_policy_result = cb.Result.from_logged_envs(env) #In this example this is the on_policy result for RandomLearner

mrucker commented 1 year ago

ha ha jinx

jonastim commented 1 year ago

if the idea is to "replay" logged data from an experiment

Cool, I'll have a look at those. I am starting with replaying the interactions of an experiment for simplicity but the main goal is to replay interactions from a contextual bandit that's running in production to improve upon its performance with testing out different hyper-parameters and algos.

mrucker commented 1 year ago

Yeah, the old LoggedEnvironment is now the logged filter. I need to update read-the-docs documentation.

Looking at your code I think the problem is that the iterator returned by iterrows() is not re-iterable.

So the first learner iterates over them and the next three learners go to read and there's nothing left in the iterator.

what about something like this?

class BanditReplay(Environment):
    def __init__(self,
                 df: DataFrame,
                 take: Optional[int] = None,
                 actions: Optional[List[Any]] = None):
        self._df = df
        self._actions = actions
        self._take = take

    def read(self) -> Iterable[LoggedInteraction]:
        for _index, row in (df[:self._take] if take is not None else df).iterrows():
            yield LoggedInteraction(
                context=row['context'],
                action=row['action'],
                reward=row.get('reward'),
                probability=row.get('probability'),
                actions=row.get('actions', self._actions)
            )

mrucker commented 1 year ago

Maybe even take it one step further:

class BanditReplay(Environment):
    def __init__(self,
                 df: DataFrame,
                 take: Optional[int] = None,
                 actions: Optional[List[Any]] = None):
        self._df = df
        self._actions = actions
        self._take = take

   @property
    def params(self):
        #These will be written to the environments table so you can know which environment a result is from
        return {'replay_default_actions': self._actions, 'take': self._take }

    def read(self) -> Iterable[LoggedInteraction]:
        for interaction in df[:self._take].to_dicts(orient='records'):
            if 'actions' not in interaction:
                interaction['actions'] = self._actions
           yield interaction

LoggedInteraction is nothing special. If you go look at it it's just a dict. It's only there for documentation purposes. Nothing checks for the LoggedInteraction type so you can simply turn your rows into dicts and pass them through. Internally coba knows something is a LoggedInteraction based on whether it has action/reward entries. Oh, that's not quite right, I didn't remember that LoggedInteraction makes rewards IPS rewards for off-policy evaluation.

jonastim commented 1 year ago

I'll tinker around a bit more with the iterator for the class to also support other data-sources than data-frames, e.g. large files, but for now your suggestions unblocked experimenting with the replay feature.

Unfortunately, the performance for the off-policy learned algos is quite a bit worse than the ones learned on the lambda simulation that generated the data (while I hoped for the opposite).

Lambda sim that generated the CSV / data-frame:

Off-policy evaluation:

I was wondering if the unnormalized rewards were affecting VW's learning (which I've seen before)

but clamping the IPS rewards with kwargs['rewards'] = [min(int(a==action)*reward/probability, 2) for a in actions] made things worse. Maybe a more sophisticated scaling would still be worth a try.

Do you have a suspicion about what might be going wrong here?

mrucker commented 1 year ago

Interesting. I'm not sure... For what it is worth 'rewards' in a LoggedInteraction is only for evaluation... VW is learning from 'reward'.

That is, in SimpleEvaluation off policy learning looks something like (note there is no 'rewards'):

learner.learn(interaction['context'],interaction['action'], interaction['actions'], interaction['reward'], interaction['probability'])

While evaluation is something like (note there is 'rewards' here):

action = learner.predict(interaction['context'], interaction['actions'])
out['reward'] = interaction['rewards'].eval(action) #In your example this is the IPS reward you clamped

This means that if you want a more accurate estimate of how the off-policy learners are doing you can also record all the rewards from the original lambda sim. Then your policy evaluation estimate will have much much lower variance than IPS. I know some people might say that in production we don't often have all the rewards to do that, but when we're working in simulation land to understand the learners I'd argue it is a fair thing to do.

One last idea, if what you really want is to evaluate VW as an off-policy learner then I suggest using the VowpalOffPolicyLearner. In your example I see bag, softmax and epsilon-greedy. The problem with those is that they're choosing exploration actions during evaluation but there is no reason to explore in off-policy learning because exploring doesn't change your training example. The VowpalOffPolicyLearner doesn't do any exploration, it always picks greedy, which is fine since our choice doesn't determine the training example we see. Internally VW doesn't use IPS, we pass it both the reward and the probability and it does this https://arxiv.org/abs/1011.1576 (though it is possible to force it to do IPS if you really want it to). So, I don't think it'd be a normalization problem.

Even with all that said, I still wouldn't expect your off policy vw learners to do worse than random... I'm not sure what is going on there.

One final note, you can scale the rewards of an environment with something like this. The example below will scale the rewards so all reward values are in [0,1].

Environments([my environments]).scale(shift="min",scale="minmax",targets='rewards')

The above example only scales the rewards not the features so the correct function coefficients will change. You can also target the context as well if you'd like to scale the context features in a similar manner. Or you could get crazy and chain this to get rewards in [0,1/2]. Etc. Etc.

Environments([envs]).scale(shift="min",scale="minmax",targets='rewards').scale(shift=0,scale=1/2,targets='rewards')

jonastim commented 1 year ago

Thank you for the detailed explanations! I realized that the last run was using observations that were heavily biased towards one action. Rerunning with a more balanced data-set changed the order of model performance a bit. The replay learners were still significantly worse than the model that generated the observations and the VowpalOffPolicyLearner wasn't better than the ones with exploration.

I also tried logging and replaying the rewards. Something in the code is buggy which leads for the reward and rewards to be interleaved under the same key. Changing if record_rewards: out['rewards'] = rewards._values to if record_rewards: out['xxx'] = rewards._values resolved it. There seems to be some kind of string matching going on.

rewards_interleaving_2

reward_interleaved

Performance was actually a bit lower when the rewards were included in the logged interactions even though rewards are now 0/1 as expected.

One thing that stands out is how the average action for many learners is around 0.2, showing a clear bias towards action 0 even though the expected value is around 0.5 and the source data-set has an even distribution.

Mainly wanted to give an update. I'll keep digging and appreciate any advice you might have. I also pushed the latest code in case you want to poke around yourself

mrucker commented 1 year ago

This is great! Yeah I'll dig in today too. The idea that things might be off is hugely concerning... I've written hundreds of unit tests trying to guard against bugs like this but here we are. I sometimes describe writing statistical software as trying to squeeze a slippery bar of soap.

One quick thought off the top of my head, about a month ago I made a pretty fundamental change to how discrete actions work. I'd completely forgotten about it until now. I don't think it is the cause here, but... just in case, the action that is logged by SimpleEvaluation is the played action's index and not its features. In this case where the action features are [0,1] it isn't noticeable. The reason for this change is that in some environments people want the actions to have features. When actions have features we may want to scale or change those features while keeping the reward for the action unchanged. Referring to discrete actions by index in SimpleEvaluation means that we're agnostic to any potential transformations to an action's feature representation.

mrucker commented 1 year ago

Ok, just finished looking through the code and have a couple of things here.

First why the interleaving:

A long long time ago we called "reward", "rewards". Eventually we realized we wanted to have logged data and so "rewards" became "reward" and "rewards" became logged rewards for replay. Confusing enough?
In order to support backwards compatibility we treat both "rewards" and "reward" as the same column in results. It hasn't been a problem because we've never tried to write 'rewards' in results. I think at this point that backwards compatibility has likely outworn its usefulness. You could remove lines 400-404 in result.py and the interleaving should go away.

Now ['rewards'] handling. There are a few complications here:

Coba represents ['rewards'] as functions. The reason for this is because some researchers use coba to do research with continuous actions and use reward functions like l1-loss and l2-loss. In the discrete case rewards is almost always an instance of 'SequenceReward' which is why it has that _values.
In the "logged" filter we actually pickle the entire interaction. So, we take each interaction, add ['action','reward','probability'], and then pickle the dictionary which preserves 'rewards' even when working with continuous actions.
Your code looks like it should be handling the rewards correctly though both when writing and reloading them in your environment.

Maybe a useful sanity check? You could use the example code below but replace the linear synthetic environment with your lambda environment (and keep all the other environment filters like shuffle and logged). Does the off-policy beat the on-policy in that case?

import coba as cb

environments = cb.Environments.from_linear_synthetic(1000, n_action_features=0).shuffle([1,2,3]).logged(cb.VowpalEpsilonLearner())

on_policy_result = cb.Result.from_logged_envs(environments)
off_policy_result = cb.Experiment(environments, cb.VowpalOffPolicyLearner()).run()

on_policy_result.plot_learners(colors=[0], labels=['on-policy'],out=None)
off_policy_result.plot_learners(colors=[1], labels=['off-policy'])

mrucker commented 1 year ago

Ah, I found your custom environment in the notebooks.

Here is what I get using my "sanity check" code above and your custom environment.

jonastim commented 1 year ago

I was able to reproduce your sanity check results but when replacing the linear synthetic with my replay environment the same bad results from before are returned.

When saving the linear synthetic results as CSV and replaying it the performance is also bad:

Seems like off-policy learning is working in general but there's some issue with the data replay.

mrucker commented 1 year ago

Yeah, something is definitely off. For example, random should be .33.

That is, E[R|a1] = .5, E[R|a2] = .13 and we know the random policy plays a1,a2 evenly so we should expect:

1/2E[R|a1] + 1/2 E[R|a2] = .32 is the approximate average reward of the random policy.

Alternatively, we can ask ourselves what is the best we can expect from a perfect policy.

Here's the calculation for mean expected reward from 10_000 perfect actions:

np.mean(np.clip(np.random.rand(10_000) - .5*np.random.rand(10_000) -.25,a_min=.5,a_max=None))

The number on the left is the calculation for R(a2). The a_min is .5 because if a2 is ever below .5 we should play a1.

This gives an average reward .505 if we play a perfect policy for 10_000 actions which is more or less what VW is getting. So VW getting .5 is more or less perfect. That is, correctly identifying when to play a2 only increases our expected reward by .005.

I played around a little bit and if we reduce E[R|a1] down to .1 then the problem becomes a little more interesting. Best constant policy (i.e., only play a2) has .13 expected reward, random policy has .115 expected reward and best contextual policy has expected reward of .19.

All that to say, the problem with your plots seems to mostly be that random is way above .32 and not that VW is at .5 because that is more or less optimal. Here's my full code using coba's built-in replay functionality where we get the values the math suggests we should. That is, a perfect policy should get .19 while a random policy should get around .115 because I defined E[R|a1] := 1/10.

import matplotlib.pyplot as plt
import coba as cb
import numpy as np

class CustomEnvironment(cb.LambdaSimulation):
    def __init__(self, n_interactions):
        super().__init__(n_interactions, self.context, self.actions, self.rewards)
        self.r = cb.CobaRandom(1)

    def actions(self, index, context):
        return [0, 1]

    def context(self, index):
        return {
            "feature_1": self.r.randoms(1)[0],
            "feature_2": self.r.randoms(1)[0],
        }

    def rewards(self, index, context, action) -> float:

        reward_probabilities_for_actions = np.clip([
            0.1,
            context["feature_1"]-.5*context["feature_2"]-0.25, # max of .75, min of -.75, Avg of 0
        ],a_min=0, a_max=1)

        return np.random.binomial(1, reward_probabilities_for_actions[action])

    def __reduce__(self):
        return (CustomEnvironment, (self._n_interactions,))

if __name__ == '__main__':
    environments = cb.Environments(CustomEnvironment(2000)).shuffle(n=30).logged(cb.VowpalEpsilonLearner())

    off_policy_result = cb.Experiment(environments, [cb.VowpalOffPolicyLearner(),cb.RandomLearner()]).config(processes=2).run()
    on_policy_result  = cb.Result.from_logged_envs(environments)

    on_policy_result .plot_learners(xlim=(10,None), colors=[0], labels=['on-policy']          , err='se', out = None)
    off_policy_result.plot_learners(xlim=(10,None), colors=[1], labels=['off-policy','random'], err='se', out = None)
    plt.legend(loc='lower right')
    plt.show()

mrucker commented 1 year ago

I never figured out what the problem was, and my changes to BanditReplay made it less generalized, but at least things seem to be working now. You should be able to add generalization back from here, and this pattern should additionally work with any data you might already have saved outside of coba.

The working example is in bandit_replay_mr.ipynb

jonastim commented 1 year ago

This is fantastic 🎉 (albeit a bit puzzling) Thanks a lot, Mark! I'll play around with it some more and try it on some logged production data.

One thing that's not quite clear to me is why the offline Softmax does worse than the online:

and more generally how we should go about model evaluation.

My goal is to compare different models and their hyper-parameters on the replayed log-data and then replace the current production model with the best one. Would you recommend to evaluate the model parameters like interaction terms and learning rate with the VowpalOffPolicyLearner learner and then use those settings for models with exploration? In the next step, for finding the model with the best exploration strategy would it be reasonable to run them then against the same replayed data to find the best strategy and hyper-parameters (epsilon, lambda, etc)?

mrucker commented 1 year ago

Yeah, I didn't see any obvious bugs so I just started simplifying/cleaning the code. Eventually it just started working.

Regarding the difference in online and offline performance it's actually just a trick due to online being evaluated on five environments in this experiment while the offline learners is only being evaluated on a single environment. If you run this plotting code below you'll see that offline and online softmax have identical performance (i.e., the two lines are on top of eachother) when we only plot the online learner's performance for the first environment.

#Here we filter to just the first environment and plot it
online_result.filter_env(environment_id=0).plot_learners(labels=['online'],colors=[0],out=None)
#Here we filter down to just the offline softmax learner so the other learners don't distract
offline_result.filter_lrn(learner_id=2).plot_learners(labels=['offline softmax'],colors=[1])

Alternatively you could remove the .shuffle(n=5) from the online experiment or turn all 5 environments into BanditReplay environments so the offline experiment uses all 5 environments from the online logged data instead of just the first one.

mrucker commented 1 year ago

And as far as evaluation on something like logged production data where we actually don't know all the rewards, I don't have as much experience there. I know all the standard techniques and estimator theory but haven't really worked on off-policy estimation a ton yet myself. I guess if you're using VW you have the OPE. VW works with loss which is why all your OPE values are negative. If you multiply them by -1 that'll give you the off-policy reward estimate (i.e., they'll align better with the coba plots).

mrucker commented 1 year ago

My goal is to compare different models and their hyper-parameters on the replayed log-data and then replace the current production model with the best one.

For hyperparameter search I'm sure you've seen that coba stores all learner hyperparameters in result along with the rest of the performance data. So, it's easy to do something like a grid search and create a list of a few hundred learners and throw them all into a single experiment. That might take a while to finish so you can also tell it to write to disk as it goes (to back up the work) and use multiple cores (to speed things up).

Would you recommend to evaluate the model parameters like interaction terms and learning rate with the VowpalOffPolicyLearner learner and then use those settings for models with exploration?

Yeah, I think that is a great idea. I'm sure you know, but most CB algorithms are premised on the assumption of realizability. So I often start designing new models by first confirming my functional form can realize a good policy given the data before doing anything else. Nothing worse than spending days trying to figure out why you aren't learning only to realize your function can't approximate the concept you want.

In the next step, for finding the model with the best exploration strategy would it be reasonable to run them then against the same replayed data to find the best strategy and hyper-parameters (epsilon, lambda, etc)?

Yeah, I think that'd be fine. You should be fairly protected against too-optimistic a performance estimation if you're using progressive validation to estimate performance which is what coba/VW uses by default (i.e., https://dl.acm.org/doi/pdf/10.1145/307400.307439). It still probably wouldn't be a bad idea to keep a hold out test-set I guess. Also, I think it is fairly standard to not shuffle production replay data as a whole. This decreases potential data re-use but what you gain is not erasing any covariate-shift or concept drift which might be present in the data.

jonastim commented 1 year ago

Thanks a lot for the helpful advice! 🙏

I'm sure you know, but most CB algorithms are premised on the assumption of realizability. So I often start designing new models by first confirming my functional form can realize a good policy given the data before doing anything else. Nothing worse than spending days trying to figure out why you aren't learning only to realize your function can't approximate the concept you want.

What's the best way to test for that? We've looked at VW's OPE loss metrics, feature importance and convergence per context before but wondering if there are better methods.

For the PR, I am planning to clean it up to get it into a mergable state:

remove notebooks and data files
remove custom result class (which is particular to my usage)
some clean-up for which I'll leave comments in the code

mrucker commented 1 year ago

I think your plan of making sure VowpalOffPolicyLearner works before considering any exploration will show whether you have realizability. My comment was simply a long-winded way of me saying I think your plan is good :). You're basically turning CB into a supervised learning problem and making sure it works before worrying about exploration.

And I think you're good using VW's OPE. VW uses a double robust loss estimate which is more or less considered the best you can do. If you have any interest here is the call in VW and here is the implementation in VW.

Finally, I'm not sure what your needs are but here are two new CB algorithms that I published fairly recently with the VW team:

https://github.com/mrucker/emt_experiments
https://github.com/mrucker/onoff_experiments (this one is not even out yet, I'll be putting it on arxiv later this week)

Both of the algorithms are written using the Coba interface, so they should be simple drop in implementations.

Oh and finally finally, I'm also looking for a summer internship (this is my final summer before graduating). If your team or any other team you know of is looking for a summer intern or has a summer project I'd love to talk with them. I looked on the official website and it doesn't look like any internships are listed that I'd be appropriate for.

jonastim commented 1 year ago

Finally getting back to this!

I think the previous issues with reading from CSV might have been related to some data-structures having been loaded as strings rather than their data-types.

When adding CSV serialization to your notebook and this converter

df_csv = pd.read_csv(file_name, converters={column: ast.literal_eval for column in
                                            ['context', 'action', 'actions', 'probability', 'reward', 'rewards']})

results were identical.

I am trying to understand one last thing before I should be able to wrap this up. In bandit_replay_mr_ips.ipynb (previous run) removing the rewards column tests the IPS estimate of LoggedInteractions. The reward for the random policy operating on the IPS estimates looks pretty off. At the bottom of the notebook I summarized my findings and I am wondering what's the best way to address this. Also, the VW policies look fine but I am worried that the accumulated reward for them is also unreliable. Maybe the OPE loss addresses that concern with VW using a different estimator but for the random policy we have to rely on the average reward.

jonastim commented 1 year ago

I think the issue above is mostly related to a limited sample size. See this screenshot for 10x.

mrucker commented 1 year ago

Nice, that makes sense. IPS is what is known as a high-variance unbiased estimator. So, given a few samples it will jump around wildly between 0 and +infinity (high-variance). However, as you get more and more data it will converge to the true value (unbiased). Alternatively, you could also do fewer samples but with a whole bunch of shuffles and see the same effect. So, something like take=1000 with shuffle(n=100).

When you manually record rewards what you get is a zero-variance unbiased estimator. That is, we know exactly what reward we would have gotten had we played those actions without needing to estimate anything (zero-variance).

If you have a small amount of data there are a couple of ways to lower the variance of the off policy estimator. As I mentioned VW is using an estimator known as doubly robust estimation that has much lower variance while still being unbiased. So, VW estimates should converge to the true off-policy value a lot more quickly (you could test this using coba just to make sure, I've never actually confirmed it myself). Currently coba doesn't really make any effort to reduce its OPE estimates because it'd require an interface change and a little work and no one has had a need for it.

(if you wanted to add doubly robust estimation to coba I could give you some guidance. It isn't too hard and might be useful if you guys wanted to use learners not implemented in VW. However, if you only ever think you'll be using VW it probably wouldn't be worth it.)

mrucker commented 1 year ago

(oh and sorry if I broke your custom Result.... the old code used this whole "_packed" thing which was super confusing and not as fast as it could be. I updated the Table class this last week to work much more like a traditional table and in the process also made it quite a bit faster when working with large amounts of results.)

jonastim commented 1 year ago

No worries! Can you have a look at the dataframe loading code added today? I haven't fully wrapped my head around pipe, filter and how they relate to environments but in the example in the dataframe_source notebook example it seems to be working fine.

mrucker commented 1 year ago

Looks good. I made one big'ish change, improved some comments while I was looking around, and added a few unittests.

The big change I made was to move the DataFrametoInteraction filter into the environments module and rename it to SimpleEnvironment... Feel free to change the name back. The real important thing was moving it. I've tried to keep the pipes module separate from the "business logic" in coba so that pipes could be potentially pulled out in the future for other projects. By business logic I mean specific implementations that are important to Contextual Bandit research but not necessarily a job pipeline like the pipes module.

Once you have a chance to make sure things still looks good to you and still work for your use case let me know and I'll merge it in.

mrucker commented 1 year ago

After thinking about it all day yesterday I reverted the name back to your original name and moved it into cob.environments.filters. The idea of calling it something like SimpleEnvironment had more to do with a long term goal of simplifying and unifying the growing collection of methods for creating environments. That's not what this pull request is about though.

jonastim commented 1 year ago

Thanks for the great improvements and adding tests! Wanted to run the design by you first before before getting to that and now I don't have to 😄 In the last commits I limited rewards recording for discrete actions and cleaned up the data source examples.

Feel free to merge the PR if things look good to you. It would be great if you cut a release with these changes, so that I can give it to some data scientists to play with.

jonastim commented 1 year ago

I would also be curious about your take on some behavior I've observed on some logged real-world data. The VowpalOffpolicyLearner has fairly consistently performed worse than the exploring learners or even random. Would you attribute this to noise / there not being something to be learned in the data, given that the average reward is within 1% of random as well as reward and OPE loss not agreeing on the best model?

I am still trying to figure out the best way for determining the "learnability" of a problem from its logged data and from there optimize the hyper-parameters.

mrucker commented 1 year ago

hmmm.... I assume this is using VW's OPE reward?

My best guess is that it has something to do with bias in the doubly robust estimator.

The double robust estimator combines IPS with a learned regressor. So, VW is kind of double dipping. When you pass it back a context-action-reward-probability to learn from it uses that example to both train the bandit policy and also to learn a regressor f_hat(x,a) = r which is used in OPE.

If for some reason f_hat(x,a) consistently differed from the greedy policy learned by the bandit learner you'd see an improvement with exploration... I can't say for sure that is happening but given that both are being trained using the exact same data there is certainly a higher probability that f_hat(x,a) would have some correlation with the bandit learner. I should also say the f_hat(x,a) used by the double robust learner isn't used by the bandit learner in any way so this isn't "hurting" the learner performance, it just might explain why the off-policy appears to be doing worse (this again is just a hypothesis).

Given how much data it looks like you have it could also be the case that the off-policy learner is converging early and then the underlying data has some kind of domain or concept shift which makes the converged greedy policy bad. I think by default VW decays its learning rate so that by the time you reach like 200,000 examples it barely updates anymore. It might be easier to see if this is a problem by using something like .plot_learners(span=<n>) which rather than calculating the average performance from the beginning of time only reports the moving average of the last <n> itneractions. Domain/concept shifts become a lot more noticeable with smaller windows.

mrucker commented 1 year ago

(Also, I just released. They want version 6.2.12.)

jonastim commented 1 year ago

Great insights, thank you!

The reward is from the actual reward function for a deployed (real world) VW model.

I'll look at the performance for different time periods and play around with the learning rate and its decay (-l and --power_t) to see if that helps with potential drift over than one month period.

Do I understand correctly that the DR estimator is only used for the OPE loss calculation but not the actual learning of the model, so potentially the model is better than its performance estimate? For an offline analysis of different models on the logged data you would still need to rely on the IPS reward estimate or VW's OPE loss (or a combination of both) to decide which is the most promising model to ship, right?

mrucker commented 1 year ago

Do I understand correctly that the DR estimator is only used for the OPE loss calculation but not the actual learning of the model, so potentially the model is better than its performance estimate?

Yes, so if, in theory, there is something "wrong" with the DR estimator the off-policy learner might be optimizing well but the DR estimator makes it look worse. (Or vice versa, if there is something wrong with the learner such as learning rate is too small and not DR you'd again see issues with performance). It's kind of like optimizing for MSE in linear regression but then evaluating using maximum likelihood. In theory if everything is working correctly optimizing for one should optimize for the other, but things happen.

VW actually makes the distinction between training and evaluation estimators a little more confusing because it allows you to choose to use DR to both train a bandit learner as well as evaluate the bandit learner's performance (see --cb_type in VW) . If you were to choose to use DR for both training and evaluation I'm not sure what it would do, but by default it doesn't do that. By default, it uses what it calls MTR (i.e., multiple regression). You could try using DR instead of MTR... but on the whole MTR really is what you want to be using. It has a lot of theoretical justifications and in general is much much simpler than DR.

All that said, I'd start with plotting a moving average instead of the progressive average and see if you see big changes in performance over time. If a moving average makes non-stationarity performance more clear then I'd definitely tweak the -l and --power_t parameters. You could also try --coin and no -l or --power_t. The coin flag I think should handle tuning -l and --power_t automatically. Sometimes it works well for me and sometimes it doesn't.

VowpalWabbit / coba

bandit replay environment #32