SforAiDl / genrl

A PyTorch reinforcement learning library for generalizable and reproducible algorithm implementations with an aim to improve accessibility in RL
https://genrl.readthedocs.io
MIT License
403 stars 59 forks source link

Class definition of Contextual Bandit #197

Closed Devanshu24 closed 4 years ago

Devanshu24 commented 4 years ago

https://github.com/SforAiDl/genrl/blob/b2f1ae4185045fc30c113144670f38f9d49b9983/genrl/classical/bandit/contextual_bandits.py#L8

Should this be Base Class for a Contextual Bandit

I'm guessing changing the above was left, from this (the classical bandit) https://github.com/SforAiDl/genrl/blob/b2f1ae4185045fc30c113144670f38f9d49b9983/genrl/classical/bandit/bandits.py#L8

Not sure. Was a little confused about the difference between the contextual and non-contextual bandits

sampreet-arthi commented 4 years ago

Yeah, it's just a typo in this case. To put it in the simplest terms, Multi-Armed Bandits (normal ones) don't consider the state of the environment into decision taking. Contextual Bandits do.

Multi-Armed Bandit Policy: https://github.com/SforAiDl/genrl/blob/b2f1ae4185045fc30c113144670f38f9d49b9983/genrl/classical/bandit/policies/base.py#L76

Contextual Bandit Policy: https://github.com/SforAiDl/genrl/blob/b2f1ae4185045fc30c113144670f38f9d49b9983/genrl/classical/bandit/contextual_policies/base.py#L75

sampreet-arthi commented 4 years ago

Can we close this? @Devanshu24

Devanshu24 commented 4 years ago

Multi-Armed Bandits (normal ones) don't consider the state of the environment into decision taking. Contextual Bandits do.

Correct, so to make an analogy the number of arms is the action_space dimension of the environment But I couldn't understand why are there multiple bandits(self._nbandits) in the Contextual Bandits base class

If I look at the eps greedy policy of both of them

Meaning the context is acting as the state of the environment, in the lookup table. However how are we utilising the multiple bandits, I did find there is something called the curr_bandit but didn't understand how it was being used

sampreet-arthi commented 4 years ago

The Contextual Bandits problem is a little different from the normal MAB setting. Think of it as a player playing a MAB but instead there are multiple bandits and you're only playing them one bandit at a time. This is useful in settings like for example when you want to recommend ads to users. Each user is a different MAB (you have only n number of ads to show) and there are some different number of users. Refer to the MAB chapter in Sutton and Barto if you need a clarification.

The curr_bandit basically tells you which n-armed bandit you're actually playing with. The actions are the same each time you play but the action space may have different meanings (like for example a different set of the same number of ads).

@threewisemonkeys-as could give better "context" maybe :p

threewisemonkeys-as commented 4 years ago

To add to this, the reason for doing action = np.argmax(self.quality[context]) is that in classical bandits algorithms you just have a lookup table. So in contextual bandit policies you are basically solving multiple independent MABs. Its a simple extension to the single bandit problem which are being solved by the regular bandit polcies. As Sampreet said _curr_bandit is being used by the bandit class to keep track of which bandit is being played. Every time a step is taken, the _curr_bandit is updated randomly.

All of this might seem a little redundant, since we just have a lookup table in the end. However in real life your context size might be huge so lookup tables are not feasible and you need a more powerful contextual bandit solving algorithm. A simple one can be having a fed forward neural network which takes the context as input and outputs a suitable action. Now this is very different from the simple single MAB problem that the bandit_policies are solving.

You can check out section 1 and 2 of this paper or this much simpler paper to get a better idea.

Devanshu24 commented 4 years ago

Thanks a lot, Atharv and Sampreet! I'll go through the papers and chapter once to get a better insight I'll close this issue, if I have any other queries about the theory I'll contact via Slack Thanks again!