Closed danielstankw closed 2 years ago
Hmm I am not expert in contextual bandits, but I do not see why it could not applied. However, there probably are wayyyy better and cleaner solutions for learning in different bandit setups than full-blown DRL algorithms. I will let others give better comments.
Btw, in general, we do not have time to provide consulting/custom tech support for theoretical-ish questions like this. These issues are more for bug reports and enhancement proposals.
Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.
Question
Task description
Setup: Robot placed on the table, board with cylindrical peg on the table and cylindrical peg.
Task: Use robot to insert peg inside the cylindrical hole in the board despite small error in the exact location of the cylindrical hole.
Description: In order to accomplish the task the parameters of the controller need to be learnt, the goal is to learn one set of parameters per episode that can solve the problem for a given radius of the peg and hole - with an ability to generalise to other sizes. Due to the fact that only one set of controller parameters should be learnt the problem is not an RL problem but more of an
contextual bandit
problem. The states are not feed to policy at each timestep, instead the context is used (which is the position of the hole) and is feed to the policy only at the beginning of each episode. Given the context policy outputs actions (parameters of controller) which are used throughout the episode. During the episode, at each timestep the reward is calculated and at the end of the episode the rewards are summed and saved together with the context in the rollout buffer (PPO).Question: Can I use the PPO with discount factor = 0, and modified environment in which the actions from the policy are used only once per episode to solve the contextual bandit problem instead of RL problem using stable baselines3, or I need to use another Python package made exclusively for contextual bandits
Additional context
I have asked somehow similar question here but I do not want to use evolution strategies: https://github.com/DLR-RM/stable-baselines3/issues/617#issue-1031150739
Checklist