Handling contextual bandit problem with continuous action space

VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

https://vowpalwabbit.org

Other

8.49k stars 1.93k forks source link

Handling contextual bandit problem with continuous action space #2427

Closed hardianlawi closed 3 years ago

hardianlawi commented 4 years ago

Description

After reading the documentations, I couldn't find any information regarding continuous action space. I am wondering how should I handle this and if I were to discretize the action, how should I discretize it or how many discrete values etc?

Link to Documentation Page

Where is the documentation in question?

jackgerrits commented 4 years ago

Hi @hardianlawi, native continuous action support is being worked on currently. Would you be able to elaborate on what guidance you'd like to see for documentation around native continuous action support? As well as documentation about using discretization?

hardianlawi commented 4 years ago

From what I understand, there are different ways of dealing with continuous action. I'm not sure which one is being worked on currently.

What I think would be helpful is:

How do we deal with continuous action space? Do we discretize the action space or parameterize it by assuming some continuous distribution like gaussian?
If we discretize it, how do we deal with the discretization error? Do we use the zooming approach by assuming Lipschitz? Also, what is the recommended exploration method here?
If we use continuous distribution, how does the model learn and explore? Is it similar to one-step policy gradient?

JohnLangford commented 4 years ago

1) We have a paper here: https://arxiv.org/abs/1902.01520 looking at an approach which I believe in. In essence, you create a continuous distribution to enable counterfactual evaluation of different strategies. @rajan-chari has been working on implementing a practical version of this which should be merged in fairly soon. 3) It is not necessarily one-step policy gradient, because you want to achieve relatively high precision in your choice of action efficiently. This benefits from some logarithmic time prediction approaches.

duburcqa commented 4 years ago

@JohnLangford Do you have any idea of what "fairly soon" means? I was about to start implementing it on my own but if someone is already doing it may be I could wait a little longer.

olgavrou commented 3 years ago

Hi @hardianlawi @duburcqa you might want to look at the CATS reduction for continuous action space here which was added recently.

olgavrou commented 3 years ago

You can find a new reduction, CBZO, in master, a contextual-bandit style algorithm meant for multi-dimensional, continuous action space, here.

Closing this issue, but feel free to open again if you have more questions.

duburcqa commented 3 years ago

Thank you for keeping up on this issue, I found another way to solve my problem since then but I'm happy to see it is moving forward.

sumpfork commented 3 years ago

You can find a new reduction, CBZO, in master, a contextual-bandit style algorithm meant for multi-dimensional, continuous action space, here.

Closing this issue, but feel free to open again if you have more questions.

Hi @olgavrou, do you know whether there are plans to extend the implementation of CBZO in VW to the multidimensional action case?

olgavrou commented 3 years ago

@ajay0 do you have any plans regarding extension of cbzo :top: ?

ajay0 commented 3 years ago

@olgavrou @sumpfork sorry for the delayed response, I somehow missed the @ mention.

We do have plans to extend to the multidimensional case. We are also planning to include a tree policy (in addition to the constant and linear policy currently available) where the action to take is decided by a decision-tree-like model instead of a linear or constant model. Unfortunately I don't have a timeline that I can give, yet.