Closed jonastim closed 1 year ago
Yeah, that's not super great even when you've got a window span of 50,000.
I've been meaning to add the double robust estimate for a while and this gave me some motivation.
I just implemented and pushed all three of the most popular OPE evaluation methods: IPS, Double Robust, and Direct Method. You can choose which one to use using a new Rewards environment filter (which is now exposed on the Environments API via rewards
). As long as you're applying it to environments with LoggedInteractions the filter should work no problem.
Here is some sample code of how they would work:
import coba as cb
#Next we create an environment we'd like to evaluate against
environments = cb.Environments.from_openml(150).take(10_000).logged(cb.RandomLearner()).materialize()
#We then create and run our experiment from our environments and learners
cb.Experiment(environments ,cb.VowpalEpsilonLearner()).run(seed=2).plot_learners(out=None,labels=['True'],colors=0,xlim=(100,None))
cb.Experiment(environments.rewards("IPS"),cb.VowpalEpsilonLearner()).run(seed=2).plot_learners(out=None,labels=['IPS' ],colors=1,xlim=(100,None))
cb.Experiment(environments.rewards("DR" ),cb.VowpalEpsilonLearner()).run(seed=2).plot_learners(out=None,labels=['DR' ],colors=2,xlim=(100,None))
cb.Experiment(environments.rewards("DM" ),cb.VowpalEpsilonLearner()).run(seed=2).plot_learners(out=None,labels=['DM' ],colors=3,xlim=(100,None))
from matplotlib import pyplot as plt
plt.show()
The above code creates this plot.
You can see that the double robust method on the covertype data set converges much more quickly to the true value.
Test out the new code for yourself and let me know what you think. If it still is not working as well as you'd like we can continue to optimize the DR implementation. I'll just need to expose some more DR hyperparameters so you can tune it for your specific case.
Oh, and I'll wait to release until you have a chance to check it out.
Once we know it's working on your dataset I'm happy to release so your teammates can get the latest more easily.
(oh also, there is one more piece of low-hanging fruit which would greatly improve these estimator accuracies. I've just been quasi avoiding it because there isn't really a great way to implement it in a model agnostic way. If you're interested in it though I can tell you how to do it with a very small modification to SimpleEvaluation.)
Oh, one final idea. If the off-policy evaluation continues to be of questionable reliability there's also a trick we could do to emulate on-policy evaluation. On-policy emulation would give a better sense of how a learner using exploration would perform over time.
(this could also be done via a very small change to SimpleEvaluation
)
Fantastic, thanks a lot!
I gave it a try with the synthetic data gen benchmark by comparing the rewards accumulation of the random policy and was a bit surprised that DM performed the best with DR sharing similarly high variance as IPS.
class CustomEnvironment:
def __init__(self, n_interactions):
self._n_interactions = n_interactions
def read(self):
rng = cb.CobaRandom(1) #so the simulation is repeatable
for _ in range(self._n_interactions):
features = rng.randoms(3)
context = dict(zip(['feature_1','feature_2','feature_3'], features))
rewards = np.random.binomial(1,
np.clip([
features[0] - 0.5 * features[1] + 0.25,
0.5
], 0, 1)).tolist()
yield SimulatedInteraction(context=context, actions=[0, 1], rewards=rewards)
online_learners = cb.VowpalSoftmaxLearner(features=[1, 'x', 'a', 'ax', 'axx'])
online_environments = cb.Environments(CustomEnvironment(300_000))#.shuffle(n=5)
online_logged = cb.SimpleEvaluation(record=['context','actions','rewards','action','reward','probability','ope_loss'])
online_result = cb.Experiment(online_environments, online_learners, evaluation_task=online_logged).run(quiet=True)
df = online_result.interactions.to_pandas()
offline_learners = [
cb.VowpalOffPolicyLearner([1, 'x', 'a', 'ax', 'axx']),
cb.RandomLearner(),
]
offline_environments = Environments.from_dataframe(df)
offline_result = cb.Experiment(offline_environments, offline_learners, evaluation_task=online_logged).run(quiet=True)
# CSV serialization (optional)
file_name = f"bandit_replay_{datetime.now()}.csv"
df.to_csv(file_name)
df = pd.read_csv(file_name, converters={column: ast.literal_eval for column in
['context', 'action', 'actions', 'probability', 'reward', 'rewards']})
# test IPS estimation in place of the simulated rewards
df = df.drop(columns=['rewards'])
offline_learners_ips = [
cb.VowpalOffPolicyLearner([1, 'x', 'a', 'ax', 'axx']),
cb.RandomLearner()
]
offline_environments_ips = Environments.from_dataframe(df).rewards("IPS")
offline_result_ips = cb.Experiment(offline_environments_ips, offline_learners_ips, evaluation_task=online_logged).run(quiet=True, seed=2)
offline_environments_dr = Environments.from_dataframe(df).rewards("DR")
offline_result_dr = cb.Experiment(offline_environments_dr, cb.RandomLearner(), evaluation_task=online_logged).run(quiet=True, seed=2)
offline_environments_dm = Environments.from_dataframe(df).rewards("DM")
offline_result_dm = cb.Experiment(offline_environments_dm, cb.RandomLearner(), evaluation_task=online_logged).run(quiet=True, seed=2)
for span in [None, 10_000, 1_000]:
plt.figure(figsize=(12,10))
online_result.plot_learners(labels=['online'], span=span, colors=[0],out=None)
offline_result.plot_learners(labels=['offline', 'random'], span=span, colors=[1],out=None)
offline_result_ips.plot_learners(labels=['offline_ips', 'random IPS'], span=span, colors=[4],out=None)
offline_result_dr.plot_learners(labels=['random DR'], span=span, colors=[6],out=None)
offline_result_dm.plot_learners(labels=['random DM'], span=span, colors=[8],out=None)
# theoretical best is the average of the expected values for each action (0.75 + 0.5) / 2 = 0.625
plt.axhline(y = 0.63, color = 'y', linestyle = '--', label='theoretical best')
plt.legend()
plt.ylim(0.45,0.65)
plt.show()
These are the variance metrics:
'DR': 20.254517158484497,
'DM': 0.046198184826941664,
'IPS': 52.16605634282907,
'random without estimators': 0.2500005698241216 # this is the one utilizing the known rewards of the simulation
and deviations from the true random policy
# error sum
'DR': 4835.63365234987,
'DM': -445.0599635839462,
'IPS': 1442.9938073088433,
'random without estimators': 0
# absolute error sum
'DR': 157218.92738468215,
'DM': 136947.93430993892,
'IPS': 162670.99380730884,
'random without estimators': 0
I also tried it on the log data replay.
For this one we don't know the underlying distribution, so I just looked at the variance:
'DR': 7.537642716511698,
'DM': 0.0033165068818281334,
'IPS': 16.221096010162526,
'production policy': 0.0002453324706672597
Does my experiment setup look ok to you? Why do you think DM worked well for my problem (compared to yours) and DR didn't?
The paper sounds super relevant to what I am trying to do. I'll give it a closer look.
I think your setup is correct. (I love that we can just drop code in here and share the experiments).
What I think you're observing is due to your data and your simulated experiment having very small probabilities.
To explain why, I'm going to write out the three methods real quick (no need to look closely, I point out relevant points below):
$\text{IPS}(\hat{a}) = \frac{r_a}{p_a}\ \text{if}\ \hat{a}==a\ \text{else}\ 0$
$\text{DM}(\hat{a}) = \hat{f}(x,\hat{a})$
$\text{DR}(\hat{a}) = (\frac{r_a-\hat{f}(x,\hat{a})}{p_a}\ \text{if}\ \hat{a}==a\ \text{else}\ 0) + \hat{f}(x,\hat{a})$
where $r_a$ is the logged reward for an interaction, $p_a$ is the logged probability, $x$ is the logged context, $a$ is the logged action, $\hat{a}$ is the action an off-policy learner chooses to play and $\hat{f}$ is a regressor that predicts how much reward we will receive for playing an action in context $x$.
Now, regarding variance, note that both $\text{IPS}$ and $\text{DR}$ have a $p_a$ in the denominator. This means that when $p_a$ gets very small the variance of our estimators is going to get very large. In theory it is possible for $\text{DR}$ to be less sensitive to small $p_a$ because its numerator has $r_a-\hat{f}(x,\hat{a})$. That is, when $\hat{f}(x,\hat{a})$ is close to $r_a$ the numerator will be close to 0 making a small value of $p_a$ less of a concern.
So, why do theoretical people like $\text{IPS}$ and $\text{DR}$ despite their sensitivity to $p_a$? It's because the reliability of the $\text{DM}$ estimate is directly tied to our ability to correctly learn the regressor $\hat{f}$. If we can learn that regressor then the CB problem is solved. So, the harder the CB problem the more circumspect the $\text{DM}$ estimator is.
Your simulated experiment has both of the hallmarks that suggest $\text{DM}$ should do better than $\text{IPS}$ and $\text{DR}$: it has small values for $p_a$ (I saw 1/1000 at the smallest) and we are able to very effectively solve the CB problem suggesting that learning a good $\hat{f}$ should be possible. My simulated experiment on the other hand had all the hallmarks to suggest that $\text{DR}$ would do well: my $p_a$ never got smaller than 1/7 and I intentionally chose a real world data set where learning $\hat{f}$ was very hard.
We can even go so far as to say that $\text{DM}$ will always have a smaller variance than $\text{DR}$ due to the way the two are calculated. The real question is which one is closer to the true value. When you look at the original experiment I shared, even though $\text{DM}$ has lower variance than $\text{DR}$ it is much further from the true value. However, in your experiment it is much closer. The million dollar question is what is true for your production dataset. I suspect your production data set has very small $p_a$ while also being very hard to learn $\hat{f}$ so I don't know what would be best.
Given the above situation, and the amount of data you seem to have, the second paper I shared might be your best bet. It would give you an unbiased estimator (like IPS/DR) without being sensitive to $p_a$ in the estimator. Instead it will be sensitive to $p_a$ in how much data you need. That is, you would throw out data with probability $1-p_a$. What you would really gain even more though is not a more reliable estimator but rather the ability to evaluate online exploration, which the current OPE methods can't really do.
Thank you, the explanations are super helpful! So, there's no silver bullet?! ;-)
Besides the actual problem the type of learner used for logging also seems to strongly influence the outcome as the (average) probability varies wildly between the different implementations.
When re-running my synthetic data gen experiment with an Epsilon-Greedy learner instead of the Softmax one DR performs better. (It's interesting how well the offline models learn from the pretty weak online one.)
I re-ran your simulation with different learners in place of the random policy but not too much changed.
My logged production data has indeed lots of low probability values. So, it sounds like IPS and DR might struggle with it (as well as DM for other reasons).
You mentioned a low hanging fruit that possibly could improve performance?
In the meantime, I'll check out the paper. Someone implemented the approach in Python here. Do you know how related it is to the VW explore eval component?
The potential low-hanging fruit is grabbing the probability for every action and then recording as the reward for an interaction the average of all rewards. In effect you would be directly calculating the expectation we would converge to if we played that interaction over and over and over again. This would remove variance due to randomization in the policy being evaluated.
And here is a slightly modified SimpleEvaluation
that should do what that paper proposes.
The only change is that you have to provide it the minimum probability across all logged interactions as well as the number of actions there are. The magic sauce happens on the two lines that I comment with #reject and sample
. That's where you are going to "lose" a lot of data due to rejection. Hopefully though it isn't so much that results become unusable.
class SpecialEvaluation(EvaluationTask):
def __init__(self,
min_p,
n_actions,
record = ['reward'],
learn: bool = True,
predict: bool = True,
seed: float = None) -> None:
self._f = 1/n_actions
self._M = 1/(n_actions*min_p)
self._record = [record] if isinstance(record,str) else record
self._learn = learn
self._predict = predict
self._seed = seed
if 'ope_loss' in self._record:
# OPE loss metric is only available for VW models
# Divide by the number of samples for the average loss metric and see this article for more info
# https://vowpalwabbit.org/docs/vowpal_wabbit/python/latest/tutorials/off_policy_evaluation.html
PackageChecker.vowpalwabbit('SimpleEvaluation.__init__')
def process(self, learner: Learner, interactions: Iterable[Interaction]) -> Iterable[Mapping[Any,Any]]:
rng = cb.CobaRandom(1)
learner = SafeLearner(learner, self._seed if self._seed is not None else CobaContext.store.get("experiment_seed"))
_, interactions = peek_first(interactions)
predict = learner.predict
learn = learner.learn
record_prob = 'probability' in self._record
record_action = 'action' in self._record
record_context = 'context' in self._record
record_ope_loss = 'ope_loss' in self._record
record_actions = 'actions' in self._record
record_reward = 'reward' in self._record
for interaction in interactions:
interaction = interaction.copy()
#reject and sample again to remove bias
if rng.random() > self._f/(self._M*interaction['probability']): continue
context = interaction.pop('context')
actions = interaction.pop('actions')
reward = interaction['reward']
action,prob,info = predict(context, actions)
#reject and sample again because a different action was chosen
if action != interaction['action']: continue
learn(context, actions, action, reward, prob, **info)
out = {}
if record_context: out['context'] = context
if record_prob : out['probability'] = prob
if record_action : out['action'] = action
if record_actions: out['actions'] = actions
if record_reward : out['reward'] = reward
if record_ope_loss:
# OPE loss metric is only available for VW models
try:
out['ope_loss'] = learner._learner._vw._vw.get_sum_loss()
except AttributeError:
out['ope_loss'] = float("nan")
yield out
Oh, and I just went and read about the explore eval
in VW. That is exactly what the above is doing (and the paper proposes).
The VW version is probably better. I wrote the above code pretty quickly without a ton of thought regarding easy efficiency gains. It's funny, I had been told this functionality didn't exist in VW. Also, it sounds like VW directly performs rejection sampling to sample according to the exploration policy, which might be more sample efficient.
Actually, the VW version is slightly different. They say they're calculating IPS reward on all examples, which I assume includes the rejected ones. What the the paper proposes and I do above in SpecialEvaluation
only uses reward from non-rejected examples so we don't have to do IPS. Oh I see, they allow you to tune the rejection rate so you might accept more than you otherwise would to be more data efficient at the cost of introducing bias. In that case you would need to use IPS...
So, the VW version seems like it is kind of a continuous parameterized version of the above/paper. On one end you have full OPE and on the other you have the unbiased explore eval above/in-paper and the VW version lets you move between those two ends giving something that is a bit of both so that you might need IPS but can still quasi-evaluate exploration algorithms.
If you put your data into VW format and only want to use VW algo's I think you could just use the VW functionality.
Here are some observations from playing around with the SpecialEvaluation
.
When running the synthetic data generation example with a softmax explorer the rejection sampling results are pretty erratic. Of the 300k samples all but 600 get rejected and the model only takes action 0. The learners as well as the random policy are way off the expected values. Min_p was 0.000697 and these are the rejection counts for bias or the wrong action respectively {'bias': 298543.0, 'action': 752.0}.
When using a random learner in the online evaluation the results look much better. Min_p is now 0.5 and the rejection counts are now {‘bias': 0.0, 'action': 78956.25}. The graph shows the RS-based model converging a bit slower but reaching similar average reward as the estimate-based methods.
When running the RS-based approach against the production data logs it rejected all but 4k samples out of 600k due to a similarly low min_p. The graphs and metrics look fine with much lower variance but I am unsure how much they can be trusted given how poorly the approach performed on the synthetic data with a similar rejection rate.
I haven’t made it through the paper yet all the way but I saw it mentions “The only requirement of this method is that the log data is generated i.i.d. with arms chosen by an (ideally uniformly) random policy” / “A related question is how to make use of non-random data for reliable offline evaluation, for which a recent progress has been made [24]” which makes me wonder if it’s actually applicable to log data of an existing model.
I also tried running the —explore_eval
VW CLI test but it failed without explanation. I’ll dig some more and try it on a different machine.
❯ vw --explore_eval --softmax -q CA -d 'synthetic_data_vw_2023-04-13 23:56:32.379581.txt'
creating quadratic features for pairs: CA
using no cache
Reading datafile = synthetic_data_vw_2023-04-13 23:56:32.379581.txt
num sources = 1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
cb_type = mtr
Enabled learners: gd, scorer-identity, csoaa_ldf-rank, cb_adf, cb_explore_adf_softmax, explore_eval, shared_feature_merger
Input label = CB
Output pred = ACTION_PROBS
average since example example current current current
loss last counter weight label predict features
libc++abi: terminating
zsh: abort vw --explore_eval --softmax -q CA -d
❯ head synthetic_data_vw_2023-04-13\ 23:56:32.379581.txt
shared |Context feature_1:0.10863548330962658 feature_2:0.798008649609983 feature_3:0.440324354916811
0:0:0.5 |Action action=0
|Action action=1
shared |Context feature_1:0.17055912036448717 feature_2:0.41731750406324863 feature_3:0.6237910492345691
|Action action=0
0:0:0.5 |Action action=1
shared |Context feature_1:0.3063608556985855 feature_2:0.15528484527021646 feature_3:0.40542458556592464
0:-1:0.5 |Action action=0
Thanks for the info. I don't know if you noticed but I pushed several changes tonight to make it possible to directly query for action probabilities from learners. This'll make it possible to (1) fully take advantage of that low-hanging fruit I'd mentioned to you (2) implement this paper and (3) more efficiently implement the original unbiased estimator paper.
Hopefully by the end of this week I'll have it all done and extensively tested.
In other news, I'm starting to look at submitting a paper about COBA to an open source software journal. Would you have any interest in being a co-author on that? Outside of myself you've probably contributed second-most to the project at this point.
Awesome! I had a look at your changes and left a comment. Feel free to tag me on PRs if you need a second set of eyes. Getting the offline evaluation right is super valuable in my opinion.
I am flattered and happy to contribute to the paper. I'll send you an email for easier coordination.
Alright,
I just made a major push. This push has two key pieces relevant to this issue:
Here's example code showing how you can test it:
import coba as cb
#This will cause a weird logging policy to be learned.
#A good way to stress test the offpolicy exploration.
class CycledLearner:
def __init__(self,learner,cycle=True):
self._learner = learner
self._cycle = cycle
def request(self,*args):
return self._learner.request(*args)
def predict(self,*args):
return self._learner.predict(*args)
def learn(self,context, actions, action, reward: float, probability: float):
if self._cycle: action = actions[(actions.index(action)+1) % len(actions)]
self._learner.learn(context, actions, action, reward, probability)
class CustomEnvironment:
def __init__(self, n_interactions, seed = 1):
self._n_interactions = n_interactions
self._seed = seed
def read(self):
rng = cb.CobaRandom(self._seed) #so the simulation is repeatable
for _ in range(self._n_interactions):
features = rng.randoms(3)
context = dict(zip(['feature_1','feature_2','feature_3'], features))
prob_of_1 = [ features[0] - 0.5 * features[1] + 0.25, 0.5 ]
rewards = [ int(x<p) for x,p in zip(rng.randoms(2),prob_of_1) ]
yield cb.SimulatedInteraction(context=context, actions=[(1,0), (0,1)], rewards=rewards)
if __name__ == "__main__":
n_processes = 8
#env = cb.Environments.from_openml(150) #Covertype
env = cb.Environments(CustomEnvironment(30_000))
sims = env.reservoir(30_000).cache().shuffle(n=20).materialize()
logs = sims.logged(CycledLearner(cb.VowpalEpsilonLearner(),False),None).shuffle().materialize()
tr = logs # Use the true rewards for OPE
dm = logs.ope_rewards("DM") # Use DM rewards for OPE
dr = logs.ope_rewards("DR") # Use DR rewards for OPE
no = logs.ope_rewards("NO") # Do not use any rewards for OPE
cb.Experiment(sims.take(4000), [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.SimpleEvaluation (record=['reward','action','probability'])).run('out14a.log.gz',processes=8,seed=None)
cb.Experiment(tr , [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.ExplorationEvaluation(record=['reward','action','probability'])).run('out14b.log.gz',processes=8,seed=None)
cb.Experiment(dm , [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.ExplorationEvaluation(record=['reward','action','probability'])).run('out14c.log.gz',processes=8,seed=None)
cb.Experiment(dr , [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.ExplorationEvaluation(record=['reward','action','probability'])).run('out14d.log.gz',processes=8,seed=None)
cb.Experiment(no , [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.ExplorationEvaluation(record=['reward','action','probability'])).run('out14e.log.gz',processes=8,seed=None)
#This uses the true rewards for OPE (not fair but good for testing)
cb.Result.from_file("out14a.log.gz").plot_learners(out=None)
cb.Result.from_file("out14b.log.gz").plot_learners(colors=2)
#This compares using DM for OPE with exploration evaluation
cb.Result.from_file("out14a.log.gz").plot_learners(out=None)
cb.Result.from_file("out14c.log.gz").plot_learners(colors=2)
#This compares using DR for OPE with exploration evaluation
cb.Result.from_file("out14a.log.gz").plot_learners(out=None)
cb.Result.from_file("out14d.log.gz").plot_learners(colors=2)
#This doesn't do any OPE and only uses the results from rejection sample
#This is similar to the previous code I sent you but much more sample efficient
cb.Result.from_file("out14a.log.gz").plot_learners(out=None)
cb.Result.from_file("out14e.log.gz").plot_learners(colors=2)
I played around with it and adapted the code a bit for my notebook.
from lyftlearnrl.evaluation.benchmark.coba.new_ope import CustomEnvironment
import coba as cb
from datetime import datetime
n_processes = 10
#env = cb.Environments.from_openml(150) #Covertype
env = cb.Environments(CustomEnvironment(30_000))
sims = env.reservoir(30_000).cache().shuffle(n=20).materialize()
logs = sims.logged(CycledLearner(cb.VowpalEpsilonLearner(),False),None).shuffle().materialize()
tr = logs # Use the true rewards for OPE
dm = logs.ope_rewards("DM") # Use DM rewards for OPE
dr = logs.ope_rewards("DR") # Use DR rewards for OPE
no = logs.ope_rewards("NO") # Do not use any rewards for OPE
record_metrics = ['reward','action','probability']
learners = [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()]
experiments = {
'sims': sims.take(4000),
'tr': tr,
'dm': dm,
'dr': dr,
'no': no
}
timestamp = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
for name, experiment in experiments.items():
cb.Experiment(experiment, learners, evaluation_task=cb.SimpleEvaluation(record=record_metrics)).run(f'out_{timestamp}_{name}.log.gz',processes=n_processes,seed=None, quiet=True)
from matplotlib import pyplot as plt
xlim = (None, None)
for span in [None, 10_000, 1_000]:
plt.figure(figsize=(12,10))
color_count = 0
for name, experiment in experiments.items():
color_count += len(learners)
cb.Result.from_file(f"out_{name}.log.gz").plot_learners(out=None, span=span, colors=color_count, xlim=xlim, labels=[f'{name} {learner.__class__.__name__}' for learner in learners])
plt.ylim(0.55,0.7)
plt.show()
Btw, as we talked about Copilot the other day, the grey code is what it proposed. Pretty impressive from that limited context:
The ope_rewards("NO")
ran for me into issues
Unexpected exception:
File "/Users/jonast/src/coba/coba/experiments/process.py", line 196, in filter
row = list(item.task.process(lrn, finalizer.filter(interactions)))
File "/Users/jonast/src/coba/coba/experiments/tasks.py", line 100, in process
yield from OffPolicyEvaluation(self._record, self._learn, self._predict, self._seed).process(learner, interactions)
File "/Users/jonast/src/coba/coba/experiments/tasks.py", line 291, in process
ope_reward = sum(on_p*log_rewards.eval(a) for on_p,a in zip(on_probs,log_actions))
File "/Users/jonast/src/coba/coba/experiments/tasks.py", line 291, in <genexpr>
ope_reward = sum(on_p*log_rewards.eval(a) for on_p,a in zip(on_probs,log_actions))
AttributeError: 'NoneType' object has no attribute 'eval'
which leads to a log file without interactions.
I don't quite understand the purpose of the CycledLearner
and why we use it without the cycling enabled.
What's the point of taking 4k samples for the simulation and running everything else with 30k samples?
Also, do we not want to use the new ExplorationEvaluation
instead of SimpleEvaluation
?
Ah, I just realized that in your code snippet the evaluation task changes between sims and the evaluators and therefore I changed the code to
experiments = {
# 'sims': sims.take(4000),
'tr': tr,
'dm': dm,
'dr': dr,
'no': no
}
timestamp = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
cb.Experiment(sims.take(4000), learners, evaluation_task=cb.SimpleEvaluation(record=record_metrics)).run(f'out_{timestamp}_sims.log.gz',processes=n_processes,seed=None, quiet=True)
for name, experiment in experiments.items():
cb.Experiment(experiment, learners, evaluation_task=cb.ExplorationEvaluation(record=record_metrics)).run(f'out_{timestamp}_{name}.log.gz',processes=n_processes,seed=None, quiet=True)
With this or when running your script, I still get this error message
Traceback (most recent call last):
File "/Users/jonast/src/python-lyft-lyftlearn-rl/lyftlearnrl/evaluation/benchmark/coba/mark.py", line 50, in <module>
cb.Experiment(tr , [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()], evaluation_task=cb.ExplorationEvaluation(record=['reward','action','probability'])).run('out14b.log.gz',processes=8,seed=None)
File "/Users/jonast/src/coba/coba/experiments/core.py", line 179, in run
return sink.read()
File "/Users/jonast/src/coba/coba/experiments/results.py", line 553, in read
return self._transactionIO.read()
File "/Users/jonast/src/coba/coba/experiments/results.py", line 436, in read
return Result(*old_to_new(env_rows, lrn_rows, int_rows), exp_dict)
File "/Users/jonast/src/coba/coba/experiments/results.py", line 81, in old_to_new
int_table.insert(cols=index_columns+ordered_data_cols)
File "/Users/jonast/src/coba/coba/experiments/results.py", line 211, in insert
assert len(set(map(len,cols))) == 1, "Different sized column entries were provided."
AssertionError: Different sized column entries were provided.
with one of the columns being off by one element
Thanks for your latest changes in Coba, the simulation is now working fine for me.
This is the notebook
PROCESS_COUNT = 10
SAMPLE_COUNT = 30_000
# env = cb.Environments.from_openml(150).take(SAMPLE_COUNT)
env = cb.Environments(CustomEnvironment(SAMPLE_COUNT))
sims = env.reservoir(SAMPLE_COUNT).cache().shuffle(n=20).materialize()
# not using cycled epsilon learner
logs = sims.logged(cb.RandomLearner(),None).shuffle().materialize()
tr = logs # Use the true rewards for OPE
dm = logs.ope_rewards("DM") # Use DM rewards for OPE
dr = logs.ope_rewards("DR") # Use DR rewards for OPE
no = logs.ope_rewards("NO") # Do not use any rewards for OPE
record_metrics = ['reward','action','probability']
learners = [cb.VowpalOffPolicyLearner()]
# learners = [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()]
def get_learners(rewards_type):
# Use non-exploring learner for ground truth
return [cb.VowpalOffPolicyLearner()] if rewards_type == 'tr' else learners
experiments = {
'sims': sims.take(SAMPLE_COUNT//2),
'tr': tr,
'dm': dm,
'dr': dr,
'no': no
}
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
for name, experiment in experiments.items():
evaluation_task = cb.SimpleEvaluation(record=record_metrics) if name == 'sims' else cb.ExplorationEvaluation(record=record_metrics)
cb.Experiment(experiment, get_learners(name), evaluation_task=evaluation_task).run(f'out_{timestamp}_{name}.log.gz',processes=PROCESS_COUNT,seed=None, quiet=True)
from matplotlib import pyplot as plt
xlim = (None, None)
for span in [None, 10_000, 1_000]:
plt.figure(figsize=(12,10))
color_count = 0
for name, experiment in experiments.items():
color_count += len(learners)
cb.Result.from_file(f"out_{timestamp}_{name}.log.gz").plot_learners(out=None, span=span, colors=color_count, xlim=xlim, labels=[f'{name} {learner.__class__.__name__}' for learner in get_learners(name)])
plt.ylim(0.55,0.625)
plt.show()
We can consider tr
as the ground truth with the offline learner, right?
In my simulation with a random learner, no
performed the best closely followed by dr
.
I also ran it with the Softmax and Epsilon learners (learners = [cb.VowpalSoftmaxLearner(),cb.VowpalEpsilonLearner()]
)
no
performed the best again but dr
was one of the worst.
no
also worked best followed by dr
for env = cb.Environments.from_openml(150).take(SAMPLE_COUNT)
but this time lost about 87% of the samples.
When running the evaluation for a learner other than random, there's a significantly higher loss of samples.
For logs = sims.logged(cb.VowpalEpsilonLearner(),None).shuffle().materialize()
about 1% of data was left, for some reason only from one environment (9).
For Softmax, there are less dropped samples but no
is way off while dr
seems ok.
Can you help me interpret these results a bit better as I haven't fully wrapped my head around the logic in ExplorationEvaluation
.
With a friendly logging policy no
seems to do great but with logs from an actual model things are less clear.
Yeah, sorry I kind of disappeared. I had two papers due this week for conferences. I've dug out now and will respond tomorrow.
Glad to hear you're back from your time off. It sounds like you got good news about your employment situation.
No worries! Yes, I survived (unlike half of my team). Hope everything went well with the papers!
Let me give you some updates on the latest developments on my end. I've run a couple of more experiments with the synthetic data generation (50% rejected) and the OpenML 150 data-set (85% rejected) and they all looked fine with the random explorer. Our production data collected from an RND learner could also be properly evaluated (89-95% rejected).
A couple of open questions that came up:
tr
in combination with the off-policy learner be considered the ground-truth and you evaluate other learners in how much they differ from it?dr
and dm
were close to identical. Is that expected and in which scenarios would they differ?from matplotlib import pyplot as plt
SAMPLE_COUNT = 20_000
xlim = (500, SAMPLE_COUNT)
for span in [None, SAMPLE_COUNT // 5, SAMPLE_COUNT // 20]:
plt.figure(figsize=(12,10))
color_count = 0
for name, experiment in experiments.items():
color_count += len(learners)
result = results[name]
result.filter_fin("min").plot_learners(out=None, span=span, colors=color_count, xlim=xlim, labels=[f'{name} {learner.__class__.__name__}'
for learner in get_learners(name)])
plt.ylim(0.58,0.62)
plt.show()
This plot makes it look like no
is closer to tr
than dr
.
I thought it might have been an artifact from averaging across the different environments and therefore plotted the environments sequentially.
# Plot progressive reward by environment sequentially (rather than averaged as above)
plt.figure(figsize=(12,10))
ylim = (0.61,0.64) # clipped some environments with lower reward, extend to see them
for estimator in ['tr', 'dm', 'dr', 'no']:
df = results[estimator].interactions.to_pandas()
reward = results[estimator].interactions.to_pandas()\
.groupby('environment_id')\
.apply(lambda row: (row.reward.cumsum() / row['index']).rename('progressive_reward'))\
.reset_index(drop=True)
reward.where((ylim[0] < reward) & (reward < ylim[1]))\
.dropna()\
.plot(label=estimator)
plt.ylim(ylim)
plt.legend()
plt.show()
Again, it looks like no
is closer to tr
.
When calculating the MAE, dr
is better, though.
# MAE from ground truth
ground_truth_reward = results['tr'].interactions.to_pandas().reward
{name: abs(result.interactions.to_pandas().reward - ground_truth_reward).mean() for name, result in results.items()}
{'sims': 0.4779387433591132,
'tr': 0.0,
'dm': 0.09576565393824185,
'dr': 0.09576565393824185,
'no': 0.14568421434931814}
# Calculate MAE per environment and learner
ground_truth = results['tr'].interactions.to_pandas()
mae_df = results_df.groupby(['environment_id', 'learner_id'])\
.apply(lambda row: abs(row.reward - ground_truth[ground_truth.environment_id == row.environment_id.iloc[0]].reward).dropna().mean())
mae_df
plot_df = mae_df.unstack(level=0).T
plot_df.columns = results.keys()
plot_df
plot_df.plot(ylim=(0.05, 0.2), xticks=plot_df.index)
Some insights into the MAE calculation: DR
NO
Absolute diff
Any idea what might be up here? The same behavior has been observed on the synthetic data gen and the OpenML data-set:
Bonus question, how do you measure the convergence for your models? I've looked at the entropy of the models' probability outputs but as they vary wildly across learners (e.g. softmax and RND), I wonder if there's a better metric. Maybe incremental OPE loss plateauing?
I also sent you an email about the Coba paper.
Alright,
I've looked through your questions and there's a lot to unpack... Unfortunately, changes I just pushed might invalidate a lot of your results. After doing a lot more testing on a lot more datasets I determined that the VW flags I was using for DM and DR weren't working right. When logging probabilities got small VW could begin estimating rewards way outside of the range of anything we'd seen so far. So, I changed how I was calling VW and he results I'm seeing now look way way better. This could very well explain why you were seeing DM and DR doing so much worse than NO.
It's great to hear that the ExplorationEvaluation method seems to be working for you... Yeah, I found that I had to make qpct very small for it to work well which means a lot was rejected but the estimates looked good compared to the true online. In the plot below on-policy is the "True" exploration performance while all other lines are attempts to estimate the blue line from logged data. It looks like the best line is using logged data with ExplorationEvaluation(qpct=.005,ope=True)
.
The script that I used to generate that plot is now in the repo in ./examples/scripts/exploration_eval.py
Now your other questions:
For the synthetic case, can tr in combination with the off-policy learner be considered the ground-truth and you evaluate other learners in how much they differ from it?
In all of the experiments dr and dm were close to identical. Is that expected and in which scenarios would they differ? How does the rejection sampling impact the actual performance of the learners (not just the evaluation metrics)? As we control which samples the model learns from here, should we employ similar techniques for the actual continuous (daily) learning of the model in production or would you always want to learn from every sample there? Further, can we employ the DR-ns technique in VW's internal learning rather than the default DR (or maybe use MTR)?
Can we cut a Coba release after merging https://github.com/VowpalWabbit/coba/pull/40?
Something in my evaluation plots has puzzled me. The progressive reward graphs don't line up with the MAE calculations.
Great, I'll give the new estimators a try!
I hope the questions are clearer with some more details.
Now your other questions:
- For the synthetic case, can tr in combination with the off-policy learner be considered the ground-truth and you evaluate other learners in how much they differ from it?
- I'm not sure I understand what you're asking here...
My question is if we can consider the policy learned by the VowpalOffPolicyLearner
on the tr
data as the best possible solution that maximizes the accumulated rewards. My understanding is that tr
contains the true (non-estimated) rewards of the simulation and the learner always takes the best action (without exploration) which should result in the best possible solution / lowest regret and therefore we should compare all candidate learners against this one.
- In all of the experiments dr and dm were close to identical. Is that expected and in which scenarios would they differ? How does the rejection sampling impact the actual performance of the learners (not just the evaluation metrics)? As we control which samples the model learns from here, should we employ similar techniques for the actual continuous (daily) learning of the model in production or would you always want to learn from every sample there? Further, can we employ the DR-ns technique in VW's internal learning rather than the default DR (or maybe use MTR)?
- So I'm not sure I completely follow... For your first question, yeah DR and DM will likely be very similar. DR is DM with a few extra bits added to it. So, if we think of it as a function fDR(x,a)=fDM(x,a)+some stuff. Or in OOP terms you could say DR is a subclass of DM. It does everything DM does and then adds a tiny bit extra. The rest I'm not sure if I'm following...
In previous experiments, learners using DR and DM based estimates differed quite significantly (plenty of examples in this thread, for example), so I am curious why they are virtually identical here.
The second part of the question is about how the performance of the learners is affected by rejection sampling. RS effectively creates a curated training set for the model to learn on (model doesn't learn on rejected samples). I am wondering how the model performance is expected to differ between a model that learned on the full data-set vs. one that learned on the curated data.
I've been in the process of moving to Seattle to intern with Microsoft Research. I'm actually working on this while at the airport on layover.
Awesome, hope it's an interesting and fun project!
I was unable to run your script because the JSON file is missing from the repo.
Suspecting that the 208 refers to the OpenML dataset I tried envs = cb.Environments.from_openml(208).take(n_take)
instead but that fails with
File "/Users/jonast/src/coba/coba/learners/safety.py", line 186, in predict
self._pred_kwargs = isinstance(pred[-1] if self._pred_batch != 'row' else pred[0][-1],abc.Mapping)
IndexError: list index out of range
The new Coba changes seem to perform slightly worse on my evaluation of the OpenML 150 dataset with default ExplorationEvaluation
settings.
Reward averaged across environments Old
New
Reward sequential for all environments Old
New
MAE per environment Old
New
MAE across all environments Old
{'sims': 0.3840010133684225,
'tr': 0.0,
'dm': 0.11914597603804751,
'dr': 0.11914945464423198,
'no': 0.2727451437070426}
New
{'sims': 0.38447523427167957,
'tr': 0.0,
'dm': 0.14228195264850155,
'dr': 0.14228767702940992,
'no': 0.2735560106010034}
Hmm...
It looks like that file is in the repo?
The 208 refers to a total of 208 Openml data sets in the experiment. This is a collection of datasets I've just curated over time. It's hard to tune a lot of this stuff for everyone so I like to look at a whole whole bunch of data sets to set default values though it's certainly not perfect.
Also could you share your exact experiment code with me? I'm still not sure why your DM/DR are identical identical. I didn't realize how similar they were for you.
Here is more what you should be seeing with 150. The DM and DR follow a similar path with the DR having some big shifts.
Maybe we should hop on a call together again at some point? I bet we could clear a lot of this up.
Oh, and yes you are completely right about rejection sampling. In this case though we are "curating" to make the data look like the exploration policy we want to evaluate.
That is, imagine we want to answer the question "how would a VW CB learner using RND exploration do in a production environment?" One way to answer this is to release it in production and watch but that carries some risk. It'd be nice if we could find this out without doing that. So, what we can do is use rejection sampling on logged data that we already have to try and simulate what RND would do on a production system. The RS is making the logged data look like what would happen if RND were released in the production system. Does that make sense?
Maybe to put it more concretely, say we previously released VW epsilon learner in production and logged a bunch of data. Then after the fact we want to know what would have happened if we had released VW RND instead. We can apply rejection sampling to the VW epsilon data so that it looks like VW RND data to answer that question...
Along those lines, the importance of "tr" in these experiments is not that it has the true rewards but that it is the "true" performance of an online learner. In that regard 'no' really isn't a very fair comparison because I suspect the logged data in your experiments is identical to the tr learner... Doing it that way isn't very realistic because in practice our logged data likely doesn't look anything like the exploration policy we want evaluate (i.e., your logged data probably doesn't look like data that would have been produced had VW RND been running).
A lot of this is really subtle... I hope it makes sense...
It could also be the case that I misunderstood what you're trying to get at. If all you want to do is learn the best possible policy from the logged data then the rejection sampling doesn't help at all. In that case all the data should be used for learning. The only reason to do RS is because we want to know how an exploration policy would have performed which is different from what is the absolute best policy we can learn from this already logged data...
Sorry, I was blind! Running the script worked after changing the path to '../templates/208_multiclass.json'
Here's a colab notebook that shows the same behavior for DR and DM. The compute is pretty weak and running the evaluation took about half an hour.
Thanks for the RS explanation, that makes things much clearer. It answers my question if there's any point of doing it during recurrent daily retraining - it doesn't because it's the same learner / RS should barely have any effect.
There are a couple of different things I am trying to achieve.
First, I want to gain confidence in the estimators' performance since previous experiments with IPS/DR showed very high variance and RS got rid of 99% of samples.
For this, we use data-sets / synthetic data generation for which we know the underlying distribution. I've mostly used a random learner as the logging policy and I was hoping to use tr
with the VowpalOffPolicyLearner
learner as the best solution against which all other estimators have to measure up against.
My understanding is that for these simulations we know the rewards of all actions for each interaction and the VowpalOffPolicyLearner
is gonna take the best one. Is that incorrect or is there a better way to get the best policy to which the other models are to be compared with?
Going by your last comment about not using RS for it, would a SimpleEvaluation
on the tr
data with VowpalOffPolicyLearner
be a better performance benchmark?
After identifying the most robust estimator the goal is to use it on the logging data of our production RND model and evaluate other exploration techniques as well as HPs against it. It seems like the RS component will be helpful for evaluating other algos like softmax but it's not really necessary when evaluating other HPs for RND learners. According to the paper it shouldn't have much of a negative effect either, though, if the learners are pretty similar (idempotent self-evaluation).
I also sent you a meeting invite for tomorrow to discuss some of the subtleties. Hope that works! Feel free to move it around, my schedule is pretty flexible (East Coast).
Alright, after doing some testing, most of the weirdness we've seen is due to using cb.RandomLearner for the logging policy:
My understanding is that for these simulations we know the rewards of all actions for each interaction and the VowpalOffPolicyLearner is gonna take the best one.
VowpalOffPolicyLearner always takes what it believes is the best action... That doesn't mean its belief is correct... If you have a good understanding of epsilon-greedy then another way to think of VowpalOffPolicyLearner is as an epsilon-greedy learner with epsilon=0. That is, it still learns like all other VW learners, there's just no reason to pick an action that you think is bad if you won't get any new information from it. VowpalOffPolicyLearner is the best VW baseline for off policy because it never takes an action that it believes is bad. It doesn't mean its belief is correct because it is learning like every other learner. In comparison, VowpalEpsilonlearner will take an action that it believes is bad with probability epsilon even though there's no benefit for doing so in the off-policy setting.
Going by your last comment about not using RS for it, would a SimpleEvaluation on the tr data with VowpalOffPolicyLearner be a better performance benchmark?
The short answer is I think so. Slightly longer answer is OffPolicyEvaluation is what you actually want but that is what SimpleEvaluation will do internally so there's no difference between the two. The longest answer is that there are actually a few choices here. What you suggest above I think is the best learner and training method but doesn't answer the question of how to do reward estimation... You need all three to get a good base line: (1) correct learner (2) correct training method (3) correct reward estimator. Maybe we could talk about specific experiments you could run to feel confident in (3)?
Also I want to say that I'm sorry, we've been at this for a long time. I'm trying to think of a path that gets you directly to results that are useful to you. You are learning a lot and I can clearly see your understanding increasing. The RS stuff may have been a bit of a diversion that will pay off eventually. In the immediate term it sounds like talking about experiments you could run to feel confident in correct reward estimators gets you almost to immediate pay out. Thoughts?
Thanks for the great explanation! I just re-ran the experiment with an RND logging policy and the results are much more in line with expectations. About 93% of samples are rejected from the OpenML 150 data-set with an VowpalOffPolicyLearner
learner.
VowpalOffPolicyLearner always takes what it believes is the best action... That doesn't mean its belief is correct... If you have a good understanding of epsilon-greedy then another way to think of VowpalOffPolicyLearner is as an epsilon-greedy learner with epsilon=0. That is, it still learns like all other VW learners, there's just no reason to pick an action that you think is bad if you won't get any new information from it.
In this case, the model's believes represent the actual truth, right? It sees the true rewards (that we know because it's a simulation) for each action and greedily takes the best one which results in the optimal policy, no?
What you suggest above I think is the best learner and training method but doesn't answer the question of how to do reward estimation... You need all three to get a good base line: (1) correct learner (2) correct training method (3) correct reward estimator. Maybe we could talk about specific experiments you could run to feel confident in (3)?
In my mind, I don't care about the reward estimation for the optimal performance baseline (or rather benchmark). With this learner being allowed to look at the actual ground truth rewards (tr
) and greedily taking the best option it wouldn't be much of a reward estimate but the realization of the maximum reward accumulation, right?
The focus shifts to the estimators when evaluating how close they (dm
, dr
, no
) come to the performance benchmark established above.
Also I want to say that I'm sorry, we've been at this for a long time. I'm trying to think of a path that gets you directly to results that are useful to you. You are learning a lot and I can clearly see your understanding increasing. The RS stuff may have been a bit of a diversion that will pay off eventually. In the immediate term it sounds like talking about experiments you could run to feel confident in correct reward estimators gets you almost to immediate pay out. Thoughts?
Not at all, I very much appreciate your patience and great explanations that have been more helpful than the VW docs or any other resource for that matter. I think we are very close and I just want to double check that my understanding is correct and we have explanations for odd behaviors. The offline experiments for improvements to our production use-case look promising and we are going to validate them with a real-world experiment soon.
Looking forward to chatting with you about some of the details on Monday.
Sorry for the delay. Updating the plotting code is always way more complex than I expect.
There's one more update I need to make to incorporate every experiment we've done but what I've pushed now is a start.
Here's the experiment code:
import coba as cb
class CycledLearner:
def __init__(self, learner: cb.Learner) -> None:
self._learner = learner
@property
def params(self):
return self._learner.params
def request(self, context, actions, request):
return self._learner.request(context, actions, request)
def predict(self, context, actions):
return self._learner.predict(context, actions)
def learn(self, context, actions, action, reward, probability):
action = actions[(actions.index(action)+1)%len(actions)]
self._learner.learn(context, actions, action, reward, probability)
if __name__ == "__main__":
n_processes = 8
n_take = 4_000
envs = cb.Environments.cache_dir('.coba_cache').from_template('./examples/templates/208_multiclass.json',n_take=n_take)
logs = envs.logged(CycledLearner(cb.VowpalEpsilonLearner())).shuffle().chunk().ope_rewards([None,'IPS','DM','DR'])
result = cb.Experiment(logs, cb.VowpalEpsilonLearner()).run(processes=n_processes)
result.filter_fin(4000).plot_learners(l='ope_reward')
result.plot_contrast('None',['IPS','DM','DR'],labels=None,c='ope_reward',x='ope_reward',err='sd',boundary=False,legend=False)
You can see that I'm still using CycledLearner. The only reason is to make sure the simulated data isn't too easy.
Pasted below is the plot from the second plot from the experiment which shows the difference between the true value and estimators across all 208 datasets.
Thanks a lot, Mark! That's one big diff that I'll study some more tomorrow.
The cycled learner is still a bit confusing to me. Doesn't this
def learn(self, context, actions, action, reward, probability):
action = actions[(actions.index(action)+1)%len(actions)]
self._learner.learn(context, actions, action, reward, probability)
mean that the model learns the reward that was received for action a_i
for a_i+1
.
So, for a binary model it would learn all the rewards for action 0 for action 1 and vice versa?
I suspect I am missing something here, but if the intention is to cycle through the actions wouldn't we also take the corresponding reward and probability for the model to learn the right relationships? Or if this is intended to introduce noise, we would only do it for some percentage of observations.
No, you're exactly right. The logging policy is going to do horribly. It's going to learn to play all the wrong actions. It is just to introduce noise. Maybe ShiftedLearner would have been a better name. You'd never want to do it in practice.
The greater the difference between the logging policy and the policy you're evaluating the harder it is to do OPE. So, I was just trying to emulate a quasi-worst case scenario for off-policy evaluation. As you noticed the logged policy is going to learn to play a_i+1
, which we know is wrong (but that doesn't matter), while the policy we are evaluating will learn to play a_i
. So the two policies are going to look very different.
Another perspective here. When we're doing this OPE stuff we're using mathematical techniques known as Importance Sampling. Importance Sampling is actually very similar to the Rejection Sampling you already saw in the ExplorationEvaluator. Both of the sampling techniques are trying to learn something about a different distribution. Rejection Sampling does this by throwing out data while Importance Sampling does this by scaling data (e.g., this scaling is exactly why IPS can have very high variance). When the logging policy equals the sampling policy the value we scale by is 1
. This means techniques like IPS have no extra variance when logging and evaluation policies match (see the second plot below) but this isn't what we see in practice because we don't normally have perfectly simulated data.
CycledLearner may not be necessary to get meaningful results. Shuffling may be enough. Here is the result of the above experiment with CycledLearner removed. You can see the results are similar to above but IPS looks slightly better here (that is IPS is different from the true value by -.05 to .05 while with the CycledLearner we see the IPS difference from true can get as big as -.15).
And here is the result of the experiment with both CycledLearner
and shuffle
removed. See that with those removed IPS and DR have almost zero variance because the logging policy and evaluation policy are nearly the same so we scale by 1 almost always.
Thanks for your explanations and sorry for the on-call induced delay! I've had some troubles re-creating your results with the latest Coba commit. Here's the colab notebook (doesn't actually execute / just uploaded my locally ran notebook for sharing the graphs).
Just to clarify, None
includes the actual simulation rewards (rather than estimated values) and we compare the estimators on how closely the rewards of the learners that use them match the learner that uses the actual rewards, right?
The ylim
arg seems to not be respected for plot_contrast
which makes it a bit trickier to compare the results but I am surprised that DM seems to perform best - reward delta closest to 0 with similar spread as the other estimators.
It's doing better in some relevant scenarios like logged random exploration and VowpalOffPolicyLearner
or logged VowpalRndLearner
and VowpalSoftmaxLearner
offline learning.
An unrelated question, what has worked best for you in processing / normalizing rewards? We've observed different techniques having significant results on the learners' arm selection. I asked in the VW channel but the thread died.
Hey sorry, I'll respond tomorrow. I've been a little busy with my internship. I've set aside a few hours tomorrow to work on this. I'm going to present Coba at the upcoming ICML conference (end of July) so I'd like to get all this fixed before then.
Quick answers:
plot_learners
and plot_contrast
working with the new changesI'll write more tomorrow.
Alright,
I just pushed the final completed notebook showing all the new 'logged' functionality. The experiments in the notebook are the blurbs I've been sending you but now everything is unit tested and fairly stable so hopefully you can reproduce without any issues.
I also looked at the ylim
problem and it is going to take a little bit to fix. I'll need to do a lot of testing to make sure any fix I make doesn't break anything else. The plotting stuff is super super fragile. There is a simple work around in the meantime. If you tell it it not to output anything by setting out=None
you can then use matplotlib's declarative interface to change anything you want manually before plotting. Here's an example:
import matplotlib.pyplot as plt
result1.filter_fin(4000).plot_contrast('None',['IPS','DM','DR'],x='ope_reward',l='ope_reward',p='openml_task',out=None)
plt.xticks(['IPS-None','DM-None','DR-None'],['IPS-GT','DM-GT','DR-GT'])
plt.ylim(-0.01,0.01)
plt.show()
Thanks, Mark!
I ran the notebook and some variations of it on my end with very similar results.
It's a bit surprising how well DM does after seeing how far off it seemed on single experiments.
When replacing the MisguidedLearner
with an RND one DR looks a bit better.
I am wondering what's the best GT definition. The notebook uses an OnPolicyEvaluator
for Epsilon Greedy on the environment.
How does that compare to using an OffPolicyLearner
with OffPolicyEvaluator
? The GT shouldn't need to explore but greedily maximize the known rewards, no?
It would be great if you could add an explanation for the implications of the large number of rejected samples in the last experiment of the Logged
notebook.
For evaluating different candidate models, which estimator metric are you looking at?
The diff between GT and estimator is smallest for the ExploreEval option but the OffPolicyEvaluator
reaches about the same progressive reward.
What's the takeaway for which evaluation technique you should use for a real-world problem for which you don't know the GT (and does it depend on how many samples you have in relation to the complexity of the problem)?
When trying to run the second (EvalExplore) experiment with an RND learner instead of the misguided one there were many instances of these errors (running on a 30 core machine with the same number of processes):
2023-07-06 15:34:19 -- pid-1173 -- Unexpected exception:
File "/mnt/user-home/git/coba/coba/experiments/process.py", line 155, in filter
yield ["T1", env_id, SafeEnvironment(env).params]
File "/mnt/user-home/git/coba/coba/environments/primitives.py", line 165, in params
params = self.environment.params
File "/mnt/user-home/git/coba/coba/pipes/primitives.py", line 77, in params
return resolve_params(list(self))
File "/mnt/user-home/git/coba/coba/pipes/primitives.py", line 52, in resolve_params
params = [p.params for p in pipes if hasattr(p,'params')]
File "/mnt/user-home/git/coba/coba/pipes/primitives.py", line 52, in <listcomp>
params = [p.params for p in pipes if hasattr(p,'params')]
File "/mnt/user-home/git/coba/coba/environments/filters.py", line 1168, in params
return {"learner": SafeLearner(self._learner).params, "logged":True, "log_seed":self._seed}
File "/mnt/user-home/git/coba/coba/learners/safety.py", line 132, in params
params = params if isinstance(params,dict) else params()
TypeError: 'property' object is not callable
2023-07-06 15:34:19 -- pid-1173 -- Unexpected exception:
File "/mnt/user-home/git/coba/coba/experiments/process.py", line 168, in filter
interactions = peek_first(env.read())[1]
File "/mnt/user-home/git/coba/coba/utilities.py", line 136, in peek_first
first = list(islice(items,n))
File "/mnt/user-home/git/coba/coba/environments/filters.py", line 83, in filter
yield from map(methodcaller('copy'), super().filter(items))
File "/mnt/user-home/git/coba/coba/pipes/filters.py", line 437, in filter
current = list(islice(items,n_slice))
File "/mnt/user-home/git/coba/coba/environments/filters.py", line 1247, in filter
interactions = list(interactions)
File "/mnt/user-home/git/coba/coba/environments/filters.py", line 1191, in filter
for interaction, log in zip(interactions,evaluator.evaluate(env,lrn)):
File "/mnt/user-home/git/coba/coba/evaluators/online.py", line 106, in evaluate
action,prob,kwargs = predict(context, actions)
File "/mnt/user-home/git/coba/coba/learners/safety.py", line 174, in predict
pred = self._safe_call('predict', self.learner.predict, (context,actions))
File "/mnt/user-home/git/coba/coba/learners/safety.py", line 154, in _safe_call
return self._safe_call(key, method, args, kwargs)
File "/mnt/user-home/git/coba/coba/learners/safety.py", line 147, in _safe_call
return method(*args,**(kwargs or {}))
TypeError: predict() missing 1 required positional argument: 'actions'
In my latest experiment the ExploreEval (with cinit=0.2
and ope_rewards("DR")
) has shown some odd behavior.
At some point the reward just shoots up without a significant change in the dataset's reward, the model's action distribution or its OPE loss.
It affected Softmax and the RND explorer but not other learners such as Epsilon Greedy or SquareCB.
Do you have any idea of what's happening there and how to further trouble-shoot it?
Thanks for letting me know about the bug! Easy fix. I just pushed the fix.
Second, do you know if you are running ExploreEval with ope_rewards? That is the only thing I can even imagine would cause that strange behavior.
Regarding the DM experiment in the notebook, I was also super surprised that the DM mean was so close to 0. Nothing about the DM theory suggests that should be the case. I think it was just random chance. On the other hand IPS and DR are theoretically unbiased implying they should approach 0 in the limit.
Hey also, I think I'm finally going to cut a release in the next week. A lot of the changes I made to make these off-policy experiments simpler are not backwards compatible. I was trying to get all the non-backwards compatible changes done before releasing and that is more or less the case now. Let me know if you have any objections to that. It'll be version 7.0.0 given the breaking changes.
Second, do you know if you are running ExploreEval with ope_rewards? That is the only thing I can even imagine would cause that strange behavior.
That's how the experiment is executed:
offline_environments = cb.Environments.from_dataframe(df_benchmark).ope_rewards("DR")
evaluation = cb.ExplorationEvaluator(cinit=0.2,record=['context','actions','rewards','action','reward','probability','ope_loss'])
offline_result = Experiment(offline_environments, offline_learners, evaluation)\
.config(processes=PROCESS_COUNT)\
.run(
result_file=f"new_features_ope_{REGION}_{datetime.now()}.log",
quiet=True
)
One more note on why cinit
needed to be set manually - as the logging data wasn't from a CB and lacked probability information I set the probability as the relative frequency of the action for a given context. That led to the second list ([(1-i['probability'])/(len(i['actions'])-1) for i in first_100]
) to contain 0-value elements as the probability for some actions was 100%. Without cinit
, c
would be initialized to 0 and every sample would be rejected.
first_probs = [i['probability'] for i in first_100] + [(1-i['probability'])/(len(i['actions'])-1) for i in first_100]
c = self._cinit or min(first_probs+[self._cmax])
Yay for the new release 🙌 I already saw some notebooks break when checking out the latest source code but I can fix that fairly easily on our end. I'll raise a PR later today to fix some metrics logging that seems to have gotten lost in the shuffle.
I am wondering what's the best GT definition. The notebook uses an
OnPolicyEvaluator
for Epsilon Greedy on the environment. How does that compare to using anOffPolicyLearner
withOffPolicyEvaluator
? The GT shouldn't need to explore but greedily maximize the known rewards, no?It would be great if you could add an explanation for the implications of the large number of rejected samples in the last experiment of the
Logged
notebook. For evaluating different candidate models, which estimator metric are you looking at? The diff between GT and estimator is smallest for the ExploreEval option but theOffPolicyEvaluator
reaches about the same progressive reward. What's the takeaway for which evaluation technique you should use for a real-world problem for which you don't know the GT (and does it depend on how many samples you have in relation to the complexity of the problem)?
I was also wondering if you have any guidance on what's the best Ground Truth definition and evaluator based on these experiments?
I remember we talked about how currently there's no support for adding reward labels for multiple actions which would likely accelerate finding the best policy. In the absence of fully annotated examples would you recommend using Epsilon Greedy with OnPolicyEvaluator
or does the GT learner depend on which candidate learners you want to evaluate?
Is the ExplorationEvaluator
the right choice when evaluating candidate policies that differ from the logging policy or are there scenarios in which you should rather use the OffPolicyEvaluator
?
I think there might also be a performance issue with the latest code. Running the same experiment with two learners and 100k observations takes about a minute with the latest release. When pip installing the latest source code and changing
evaluation = cb.ExplorationEvaluation(cinit=0.2,record=['context','actions','rewards','action','reward','probability','ope_loss'])
Experiment(offline_environments, offline_learners, evaluation_task=evaluation)
to
evaluation = cb.ExplorationEvaluator(cinit=0.2,record=['context','actions','rewards','action','reward','probability','ope_loss'])
Experiment(offline_environments, offline_learners, evaluation)
the experiment has been running for over two hours with 100% CPU utilization but its log file is less than 5MB in size.
I'm 95% sure I found the performance problem. I pushed the patch. Please let me know if it doesn't work for you.
One more note on why cinit needed to be set manually - as the logging data wasn't from a CB and lacked probability information I set the probability as the relative frequency of the action for a given context. That led to the second list ([(1-i['probability'])/(len(i['actions'])-1) for i in first_100]) to contain 0-value elements as the probability for some actions was 100%. Without cinit, c would be initialized to 0 and every sample would be rejected.
Your work around with non-bandit data is interesting. Is the data actually not bandit or do you just not know the probability? I know you weren't asking but I think what you are doing seems appropriate since you're seeing repeated contexts and actions. I just pushed another patch that should make it so you don't have to initialize cinit
.
I was also wondering if you have any guidance on what's the best Ground Truth definition and evaluator based on these experiments?
I'm not sure I completely understand this question... You mean what is the best learner to compare against? I'm assuming you are working with logged data. In that case I'd try to beat the logged data. Once you have a learner that can do that then I'd start comparing to the learner that you know beats logged data. I always try to keep one learner that I think of as the current best and do everything I can to beat it. Once I do I retire it and start trying to beat the new best. This is where the VowpalOffPolicyLearner should shine. If you're doing off-policy learning then it should be really really hard to beat the VowpalOffPolicyLearner if you're comparing to VW learners (if not impossible in theory unless they are playing from a different model as in different features).
I remember we talked about how currently there's no support for adding reward labels for multiple actions which would likely accelerate finding the best policy. In the absence of fully annotated examples would you recommend using Epsilon Greedy with OnPolicyEvaluator or does the GT learner depend on which candidate learners you want to evaluate?
I've thought a little more about this. I don't think it'd be that hard. I guess I'm still trying to understand the motivation. What's the goal? To have a really good comparison policy?
Is the ExplorationEvaluator the right choice when evaluating candidate policies that differ from the logging policy or are there scenarios in which you should rather use the OffPolicyEvaluator?
They answer different questions. If you just want to know how good a policy is then OffPolicyEvaluator is best because you don't lose any data. If you want to know how well a policy does when learning online. Here's another way to think of it, I noticed you were passing offline_learners
to ExplorationEvaluator. It doesn't make sense to pass offline learners to ExplorationEvaluator
. You should only pass online learners to ExplorationEvaluator
. ExplorationEvaluator
is like creating a simulation using logged data. Or oohh, maybe here's a better way to say it. These two experiments below would answer the same question.
env = cb.Environments.from_openml(150,take=1000)
lrn = cb.VowpalEpsilonLearner()
result1 = cb.Experiment(env,lrn).run()
logs = cb.Environments.from_openml(150,take=1000).logged(cb.RandomLearner())
lrn = cb.VowpalEpsilonLearner()
result2 = cb.Experiment(logs,lrn,cb.ExplorationEvaluator()).run()
The first experiment is way way way more data efficient if you're able to actually interact with the environment. However, sometimes we can't do that, and instead can only get logged data from the environment we want to interact with. In that case we can use ExplorationEvaluator
with logged data. On average, given enough logged data, result2 should look the same as result1, with some caveats, regardless of what policy was used to create the logs (i.e., we could replace cb.RandomLearner() with any logging policy).
Oh, and you probably noticed but the main change with this release (beyond the plot upgrades) is that I moved all the coba.experiments.tasks
out into a new coba.evaluators
module. I also added an evaluators Table to results and the interactions table now has an evaluator_id
column. This means we can now run an experiment with several different evaluators. This is useful if you want to compare different evaluators in a single experiment (e.g., we could now run an experiment with multiple parameter settings in ExplorationEvaluator
to compare those parameters).
I'm wrapping up some of my own experiments for another paper tomorrow and will likely cut the 7.0.0 coba release this weekend, assuming I don't hit anymore bugs in the next few days with my own testing and experiments.
Great! the performance issues for ExploreEval are fixed and I removed the manual cinit
.
Your work around with non-bandit data is interesting. Is the data actually not bandit or do you just not know the probability? I know you weren't asking but I think what you are doing seems appropriate since you're seeing repeated contexts and actions. I just pushed another patch that should make it so you don't have to initialize
cinit
.
The logging data is from multiple non-contextual bandits that effectively form a CB (one bandit per context slice). Conceptionally, the data should be similar but we don't have logged probabilities.
The distributions for reward, action and probability look fairly straight forward
but the ExplorationEvaluator
results are erratic.
We ran a very similar experiment on live production data and the CB's reward was within 2-3% of the non-contextual one. OPE thinks that the CBs would perform as much as 70% better.
Here's an example of the whole data
and truncated to the relevant rejection sampling period
I was curious about VW's internal OPE estimates and tried to create a random baseline random VW
with cb.VowpalEpsilonLearner(epsilon=1.0)
. The ope_loss
difference to the best learner is 11% while the reward
is 66% higher.
This is how the OffPolicyEvaluator
looks like on the same data.
Do you have an idea why the Coba reward estimates are so much higher?
Hmm... That's very concerning... A few thoughts...
ope_loss
. As best I can tell it is simply the sum of all the loss VW has seen so far. That means it is more or less parroting the dataset. You could actually see this if you collect ope_loss
on the OffPolicyEvaluator
experiments and plotted it for all the learners. If you did this you should see two things: (1) every single learner in the OffPolicyEvaluator
experiments should have identical ope_loss
and (2) the negative of this loss (i.e., negative loss=reward) should perfectly match the dataset reward.ope_loss
is lower in the ExplorationEvaluator
experiments (and ope_loss
not all equal)? This is another interesting side-effect. If you were to run an experiment with ExplorationEvaluator(ope=False)
then the learner rewards should be the negative of ope_loss
. The ope_loss
is not equal for all learners because we are emulating exploration so learners end up getting different losses from each other due to this which doesn't happen in the OffPolicyEvaluator
.So, my take away from all this is that the off-policy reward estimator seems to be doing a horrible job. I have no idea why... If you run ExplorationEvaluator(ope=False)
then the ope rewards aren't needed at all. That will allow you to get some estimates.
Why are the reward estimators doing so poorly? I don't know... My first guess is that the DM regressor is doing really poorly (DR is a combination of DM and IPS so if DM does really bad DR does bad). Remember our past conversations where I said the problem with DM is that it is a regressor and it is really hard to know how well it is actually doing, especially on really hard real world problems? I suspect that is what is happening here. You could go look in ope_rewards and see how I make the regressor. I'm just using VW to learn a regression. It's nothing too exotic. If you got the regressor directly you could do more traditional analysis on it. You could also get a sense of how well it is doing without going into ope_rewards with something like this (this is quasi-pseudocode but hopefully it makes it clear):
logs = cb.Environments.from_dataframe(logged_dataframe).ope_rewards('dm')
errors = []
for interaction, df_row in zip(logs[0].read(), logged_dataframe.iterrows()):
df_action,df_reward = df_row['action'], df_row['reward'] #this definitely doesn't work but hopefully makes sense
errors.append(abs(interaction['rewards'].eval(df_action)-df_reward))
#Now we can calculate the mean absolute error of DM
#This number is going to be optimistic because we are testing
#With our training data, so if it is large then that is really bad...
print(sum(errors)/len(errors))
If you do that and the number doesn't look so bad then the problem is with the probabilities you're calculating. What you describe seems very reasonable and I don't think the probabilities are a problem but something is definitely wrong so who knows.
Thanks for looking into this, Mark!
The use-case is fairly tricky with not the greatest features and a good amount of noise. I've been spinning my wheels a bit going back and forth between evaluating the estimators, the learners and the actual application - probably not the most conducive to learning how all the pieces go together, so I extra appreciate your time looking over it.
- You could actually see this if you collect
ope_loss
on theOffPolicyEvaluator
experiments and plotted it for all the learners. If you did this you should see two things: (1) every single learner in theOffPolicyEvaluator
experiments should have identicalope_loss
and (2) the negative of this loss (i.e., negative loss=reward) should perfectly match the dataset reward.
The average action
and probability
are identical but the ope_loss
varies a bit across learners. I thought the ope_loss
was the result of VW's internal MTR estimator.
Here's the ope_loss plotted against the reward:
If you were to run an experiment with ExplorationEvaluator(ope=False) then the learner rewards should be the negative of ope_loss.
Indeed, the both are very similar with cb.ExplorationEvaluator(ope=False)
. The diff shows quite a bit of discrepancy but this looks like an index off by 1 error.
So, my take away from all this is that the off-policy reward estimator seems to be doing a horrible job. I have no idea why... If you run ExplorationEvaluator(ope=False) then the ope rewards aren't needed at all. That will allow you to get some estimates.
The results with ope=False
look much more reasonable with a performance gain of about 10% on which reward
and ope_loss
agree.
You could also get a sense of how well it is doing without going into ope_rewards with something like this (this is quasi-pseudocode but hopefully it makes it clear)
diffs = [abs(interaction_reward - df_reward) for interaction_reward, df_reward
in [(interaction['rewards'].eval(df_row[1].action), df_row[1].reward) for interaction, df_row
in zip(logs[0].read(), df_benchmark.iterrows())]]
s = pd.Series(diffs)
s.hist()
With the rewards all between 0 and 1, I assume DM being off by 33% on average is pretty bad.
Running ExploreEval with IPS looks even worse with all learners underperforming random 😅
Is the conclusion from this that the estimators are struggling with my data and running cb.ExplorationEvaluator(ope=False)
is the best way to go?
Yeah, here's another experiment showing the similarity of logged rewards and ope_loss over several vw learners.
I'm not really sure why the VW learners don't line up perfectly. There's some weirdness going on inside of VW. If you remove prediction when running OffPolicyEvaluator
then the VW learner's ope_loss
perfectly matches the logged dataset loss.
I agree with you. The 0.33 value seems high for DM (especially considering your basically testing on your training data in that experiment so it's going to be overly optimistic). I think part of the problem is that your dataset is pretty clearly non-stationary and I currently have things set up for stationarity... Unfortunately, there's not really a single correct way to do DM. If non-stationarity is the main problem one possible easy solution could be to modify line 1267 in environment filters and set power_t to 0. After that you can rerun the DM experiment above and see if it improves. You could also play with different features, maybe --interactions xxxa?
In fact, I think that is also part of your problem with why the learners do so poorly with ExplorationEvaluator(ope=False)
. Remember the ExplorationEvaluator
throws out a bunch of data. So, your learners are basically seeing a very sped up version of time. For example, in this plot the dip in learner performance is probably because they have begun to receive data from the huge dip in the black line.
That also means that non-stationarity happens very quickly. For example, instead of seeing 1,000 examples before reward dynamics shift they might only see 10 and then everything changes. It's hard to learn that fast. The learners don't really have much time to exploit what they've learned before things start to change.
Just curious, have you tried using just cb.VowpalOffPolicyLearner
with ope_rewards('ips')
and OffPolicyEvaluator()
? You don't have to worry about DM, you don't have to worry as much about non-stationarity because you won't be throwing data out, and given the amount of data you do have the IPS estimate will probably be pretty good.
If non-stationarity is the main problem one possible easy solution could be to modify line 1267 in environment filters and set power_t to 0. After that you can rerun the DM experiment above and see if it improves. You could also play with different features, maybe --interactions xxxa?
I am not sure if non-stationarity is the problem or if there's just very little signal above the noise.
The DM error metrics were very similar with learners configured like cb.VowpalSoftmaxLearner(features=[1, 'x', 'a', 'ax', 'axx', 'axxx'], power_t=0)
.
In fact, I think that is also part of your problem with why the learners do so poorly with ExplorationEvaluator(ope=False). Remember the ExplorationEvaluator throws out a bunch of data. So, your learners are basically seeing a very sped up version of time. For example, in this plot the dip in learner performance is probably because they have begun to receive data from the huge dip in the black line.
That makes sense, and I have seen that the ExploreEval plots show a compressed behavior of the underlying data in many of my experiments.
Just curious, have you tried using just cb.VowpalOffPolicyLearner with ope_rewards('ips') and OffPolicyEvaluator()? You don't have to worry about DM, you don't have to worry as much about non-stationarity because you won't be throwing data out, and given the amount of data you do have the IPS estimate will probably be pretty good.
That setup worked poorly will all learners under-performing random.
Some of the issues seem to be related to this data-set. OPE on the logging data of the CB from the most recent real-world experiment that just concluded looks much more stable with modest performance gains that are more realistic.
With ExploreEval however about 95% of the data was rejected.
When filtering down to one context slice there isn't much convergence apparent, however.
Running the OffPolicyEvaluator
on the data yields similar results:
I am not quite sure how much I can rely on OPE for my problem and what's the best path forward.
Maybe running on a couple of copies of shuffled data with .shuffle(n=10)
helps stabilizing the experiments. In that case we couldn't model the non-stationarity but we have temporal features like day of week that might compensate for some of the fluctuations.
If the issue is less with stationarity but more with the volume of data, up-sampling the logging data might also help the model to converge. Not sure how well that would translate to the real world.
Hi Mark,
We currently use the
LoggedInteraction
's IPS estimator to compare the accumulated reward of VW models with non-VW baselines, such as the random policy, to analyze if there's something for the model to learn. The variance of the reward estimates is so high, though, that I am worried about the reliability of making a decision based on the metric.For an example with rewards normalized to [0, 1]
there are plenty of estimates that are 1-2 orders of magnitudes off.
These plots compare the production policy rewards with that of the
VowpalOffPolicyLearner
learning on it and a random policy across different rolling average window sizes.Would a more advanced estimator such as Doubly Robust help here and would it be reasonable to implement it as an alternative to the IPS one in
LoggedInteraction
or is there a better way to gauge if the model is learning anything meaningful / performs better than random?