dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.52k stars 6.03k forks source link

Chaning Sarsa/Q-learning to deal with multiple enviroments #51

Open jackevan1 opened 7 years ago

jackevan1 commented 7 years ago

Hi, In the most of RL implementations at the start of each episode, the environment (in SARSA code for instance: state = env.reset() ) is reset to the initial states (i.e. same start point and goals states). More specifically, they learn a policy for a given environment. But how about multiple environments at the same time?

In the other words, is it possible to apply SARSA/Q-Learning in a scenario where we have multiple environments? For examples, in 5*5 grid world, we have the following two cases: env1 : [0,0] is start state/agent start position and [3,2] is the goal state. env2 : [2,1] is start state/agent start position and [1,4] is the goal state. So at the each episode, env1 and env2 are the inputs to the main loop of SARSA. Can the current version of SARSA/Q-learning be changed to learn policies for both environments at the same times? it is more like a multi-task learning with RL.

Any help would be appreciated.

Thanks, J PS0: It is not fair to call this an issue but it is more like an extension to the current implementations. PS1: Thanks @dennybritz for your wonderful job of sharing the codes. It is really helpful,

DanielTakeshi commented 7 years ago

It sounds like you might be interested in learning a policy that can solve over a distribution of the 5x5 grid-worlds? For instance, we can define the distribution to be any 5x5 grid-world with some random allotment of obstacles, and then we uniformly at random choose unique start/end points and what results is one sample from the MDP distribution. Off the top of my head, that seems like a job for algorithms such as Value Iteration Networks that can deal specifically with the concept of "distribution of MDPs".

jackevan1 commented 7 years ago

Thanks for your answer, @DanielTakeshi :) I see that this paper is recently published, so what were people using before this paper? In addition, how a learned policy can be generalized to a new environment (not particularly with this paper but in general)? e.g in a grid world, from 5x5 to 7x7? If you can point me to any papers, that would be awesome. I am new to RL world, sorry if asking very simple questions, thanks again for your help.

DanielTakeshi commented 7 years ago

I think if you want something simpler than VINs, check out the first set of experiments in that paper with GridWorld. Specifically, they were using two competing methods which were good (but not as good as VIN). But they're a lot easier to understand and implement than VINs.

jackevan1 commented 7 years ago

Thanks again for your help.