I took a look at the current code. The Webots driven Open AI gym seems correct, but I don't follow the rest. What are we expecting the agent to learn from a constant forward motion? Why are discretizing everything (and so coarsely)? What are we trying to learn here? We want to adapt or correct a user policy, not learn another that can replace them.
I took a look at the current code. The Webots driven Open AI gym seems correct, but I don't follow the rest. What are we expecting the agent to learn from a constant forward motion? Why are discretizing everything (and so coarsely)? What are we trying to learn here? We want to adapt or correct a user policy, not learn another that can replace them.
I'd strongly suggest we revisit the approaches we spent time studying. Both https://arxiv.org/pdf/1802.01744 and https://arxiv.org/pdf/2004.05097 are quite clear on what they are doing. There is sample code for both too, see https://github.com/rddy/deepassist and https://github.com/cbschaff/rsa.