Question about using contextual bandits in specific case

Hi everyone, I am working on solving a peg-in hole problem. Initially I stared with RL approach but it seems like its not the right approach for my problem.

Task description

Setup: Robot placed on the table, board with cylindrical peg on the table and cylindrical peg.

Task: Use robot to insert peg inside the cylindrical hole in the board despite small error in the exact location of the cylindrical hole.

Description: In order to accomplish the task the parameters of the controller need to be learnt, the goal is to learn one set of parameters per episode that can solve the problem for a given radius of the peg and hole - with an ability to generalise to other sizes. Due to the fact that only one set of controller parameters should be learnt the problem is not an RL problem but more of an contextual bandit problem. The states are not feed to policy at each timestep, instead the context is used (which is the position of the hole) and is feed to the policy only at the beginning of each episode. Given the context policy outputs actions (parameters of controller) which are used throughout the episode. During the episode, at each timestep the reward is calculated and at the end of the episode the rewards are summed and saved and should be used to update the policy.

As I am more familiar with the RL approach I was wondering if someone more experienced could advise if using contextual banidts is the right way to go and if so, what algorithm recommendation do you have? For simulation I use robosuite which has gym-like structure

Thank you for your help

david-cortes / contextualbandits

Question about using contextual bandits in specific case #46