VowpalWabbit / coba

Contextual bandit benchmarking
https://coba-docs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
48 stars 19 forks source link

Off-policy learners #17

Closed sanathsk009 closed 2 years ago

sanathsk009 commented 2 years ago

Hi,

Thanks for this cool project and for making contextual bandit simulations easier!

I had a few questions regarding off-policy learning: (some of this may be already addressed by VowpalOffPolicyLearner)

mrucker commented 2 years ago

Hi!

Yes the repo is a work in progress and the off policy functionality is, unfortunately, particularly anemic. (I created the current off-policy functionality several months ago when I thought I was going to work on an off policy project but ended up not).

My big project for the summer is to beef up the documentation. I know that doesn't help you too much in the short term. If you wanted to improve the off-policy functionality in the code base I think that'd be great. The really big pieces to look at there would be:

  1. coba.experiments.tasks.OnlineOffPolicyEvalTask -- This contains some evaluation logic for off-policy
  2. coba.environments.logged.LoggedEnvironment -- This contains more information about each interaction than the SimulatedInteraction

Putting it all together would look something like:

from coba.learners import VowpalEpsilonLearner
from coba.environments import LoggedEnvironment
from coba.experiments import OnlineOffPolicyEvalTask

environments = [
   LoggedEnvironment() #this is pseudo-code, you'd have to do more to actually create a LoggedEnvironment
]
learners = [
   VowpalEpsilonLearner)
]

result = Experiment(environments, learners, evaluation_task=OnlineOffPolicyEvalTask()).evaluate()
result.plot_learners()

With regards to the cost sensitive classification... I'm not sure but I think Vowpal Wabbit's --cb_adf reduces to that? I may be totally wrong about that. If it does then any of the Vowpal Learners in Coba will use cost sensitive classification. If it doesn't then you just need to create your own VW learner that handles formatting and retreiving the data from VW. VowpalArgsLearner is a good base class to look at in that regard. I wouldn't get too distracted by the VowpalOffPolicyLearner. It doesn't do anything special. It was placed there so that, in the future, we could add more unique off-policy type functionality to it without breaking the other VW learners.

I hope all that helps. Sorry the off policy functionality and documentation is currently not in the best state. All my work is on policy, so that is where most of my time and effort has gone (I add functionality to coba as I need it for my own research).

sanathsk009 commented 2 years ago

Thanks for all this information and pointers!