[Question] Can RL be modified to contextual bandit problem? - using sb3

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

Task description

Setup: Robot placed on the table, board with cylindrical peg on the table and cylindrical peg.

Task: Use robot to insert peg inside the cylindrical hole in the board despite small error in the exact location of the cylindrical hole.

Description: In order to accomplish the task the parameters of the controller need to be learnt, the goal is to learn one set of parameters per episode that can solve the problem for a given radius of the peg and hole - with an ability to generalise to other sizes. Due to the fact that only one set of controller parameters should be learnt the problem is not an RL problem but more of an contextual bandit problem. The states are not feed to policy at each timestep, instead the context is used (which is the position of the hole) and is feed to the policy only at the beginning of each episode. Given the context policy outputs actions (parameters of controller) which are used throughout the episode. During the episode, at each timestep the reward is calculated and at the end of the episode the rewards are summed and saved together with the context in the rollout buffer (PPO).

Question: Can I use the PPO with discount factor = 0, and modified environment in which the actions from the policy are used only once per episode to solve the contextual bandit problem instead of RL problem using stable baselines3, or I need to use another Python package made exclusively for contextual bandits

Additional context

I have asked somehow similar question here but I do not want to use evolution strategies: https://github.com/DLR-RM/stable-baselines3/issues/617#issue-1031150739

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

DLR-RM / stable-baselines3