david-cortes / contextualbandits

Python implementations of contextual bandits algorithms
http://contextual-bandits.readthedocs.io
BSD 2-Clause "Simplified" License
739 stars 143 forks source link

Question about using contextual bandits in specific case #46

Closed danielstankw closed 2 years ago

danielstankw commented 2 years ago

Hi everyone, I am working on solving a peg-in hole problem. Initially I stared with RL approach but it seems like its not the right approach for my problem.

Task description

Setup: Robot placed on the table, board with cylindrical peg on the table and cylindrical peg.

Task: Use robot to insert peg inside the cylindrical hole in the board despite small error in the exact location of the cylindrical hole.

Description: In order to accomplish the task the parameters of the controller need to be learnt, the goal is to learn one set of parameters per episode that can solve the problem for a given radius of the peg and hole - with an ability to generalise to other sizes. Due to the fact that only one set of controller parameters should be learnt the problem is not an RL problem but more of an contextual bandit problem. The states are not feed to policy at each timestep, instead the context is used (which is the position of the hole) and is feed to the policy only at the beginning of each episode. Given the context policy outputs actions (parameters of controller) which are used throughout the episode. During the episode, at each timestep the reward is calculated and at the end of the episode the rewards are summed and saved and should be used to update the policy.

As I am more familiar with the RL approach I was wondering if someone more experienced could advise if using contextual banidts is the right way to go and if so, what algorithm recommendation do you have? For simulation I use robosuite which has gym-like structure

Thank you for your help

david-cortes commented 2 years ago

I think you might have better luck asking in some stackexchange subsite.

Although, I don't think stateless bandits would apply to your problem, since from what I get, you only see a reward at the end of some button sequence, the length of the button sequence can vary with the problem size, and you don't have any information about positioning of the robot/peg/hole, so you don't have the components to model it as a bandit or contextual bandit problem.