DQN and Dyna-Q - Githubissues

IbrahimSobh commented 7 years ago

Hi Denny

Again, I do appreciate your work!

I was thinking of implementing DQN with Dyna-Q Algorithm where the Q(s,a) is updated not only by real experience, but also by simulated experiences generated from model M.

Dyna-Q : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/dyna.pdf Slide 27

However, I think it will be hard to train a model M for the environment:

P(s_t1 | s_t0 , a_t0)
P(r_t1 | s_t0 , a_t0)

This may require a trainable function approximator of the model (nonlinear Neural Network for example).

My idea, instead of having a model M that generates simulated experiences, we can simply use real experience of another parallel agent!

Then, two agents will help each other by providing experience.

This is inspired by A3C that uses multiple agents on different threads to explore the state spaces and make de-correlated updates to the actor and the critic.

Do you think it is a good/new/simple idea, that may speed up the training and use “real simulations” without even having a model of the environment?

Your opinion is very important

Thank you

dennybritz commented 7 years ago

Like you said, this sounds pretty similar to A3C. What exactly is the difference in your approach?

One benefit of learning a model is that you can generate experience much faster than an agent that interacts with the real world.

IbrahimSobh commented 7 years ago

Thank you denny

A3C 1- Multiple, parallel agents cooperate to update the parameters of one model 2- The model is trained, I think, according to this: I think very similar to this https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards however on CPU 3- At the end, we have one trained good model.

(correct?)

My approach I liked the idea of building a model, and then generate simulated experience from that model, and use these simulations (hallucinations) for enhancing learning process. Inspired by Dyna-Q algorithm, I was thinking: Instead of learning a model for the environment (which could be Neural Network FA, correct?) and generate simulations; Why not just make a short cut and just use other agents’ experience as if it is the simulations. In this case: 1- We have multiple agents (like A3C) 2- Each agent has its own model (unlike A3C) and no shared Model 3- The only thing that is shared between agents is just the experience provided as simulated experience. In two agents (A,B) case: Agent's A real experience is provided as Agent's B simulated experience Agent's B real experience is provided as Agent's A simulated experience

The key advantages of my approaches: 1- It could be much simpler than A3C 2- At the end we will have multiple good agents, not only one (then ensemble if we like) 3- These agents could have different deep model architectures and different Reinforcement Learning settings.

The key Disadvantages: (This looks insane!) Generally speaking, I do not see any benefit at all of using planning and simulated experience in case of software games! simulations are cheap to produce, on multi-threading for example. However in case of physical robots, it could be costly not to simulate. (correct?)

On the other hand, the real benefit from the planning and simulations may come from TDTS (TD tree search) or Dyna 2, where the agent makes a “short term” tree search (based on learned Model). The search starts from its current situation and see the best thing to do in the real environment.

Finally, (Sorry, I know this is too long comment!)

Are the advantages of my approach mentioned above are real or useless?
If useless, then the best way to make DQN make use of planning is Dyna-2? (a new algorithm we can call it PDQN for Planned DQN?)

Thank so much you in advance, I do appreciate your answers and comments

dennybritz commented 7 years ago

I think there may some confusion about the term "model" here. In RL a "model" means an explicit representation of the world (i.e. an MDP), but in the rest of Machine Learning a "model"is equivalent to what's called a Function Approximator in RL.

When you say that that in A3C the agents update a shared models that's not correct. There is no "model" in the RL sense in A3C at all.

Instead of learning a model for the environment (which could be Neural Network FA, correct?)

Yes.

Why not just make a short cut and just use other agents’ experience as if it is the simulations.

I'm not sure if I understand the benefit of this. In A3C you are already using all of the other agent's experience. It's all real experience, there is no benefit you gain by treating it as "simualted" experience. Real experience is strictly better than simulated experience.

Generally speaking, I do not see any benefit at all of using planning and simulated experience in case of software games! simulations are cheap to produce, on multi-threading for example. However in case of physical robots, it could be costly not to simulate. (correct?)

Yes, that's one of the benefits of building a model of the environment. But even in software it may be helpful to have a model, e.g. when the environment "rules" are very simple to learn, but the value function is hard to learn.

I'm sure there are interesting new ways to combine planning and DQN though, so it's good to think about this.

IbrahimSobh commented 7 years ago

Thank you

And sorry that I've used mixed expressions.

I totally understand that the model in RL means the MDP model of the environment.

In A3C, What I wanted to say is that: All agents update the same FA parameters as in this link: https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards (correct?)

In A3C you are already using all of the other agent's experience. It's all real experience, there is no benefit you gain by treating it as "simualted" experience.

I agree! I do not know how to beat A3C! is it too good to be defeated?!

I'm sure there are interesting new ways to combine planning and DQN though, so it's good to think about this.

I am trying to find some contribution in this direction, could you please just give me a clue to two?

Thank you again for your comments

fferreres commented 7 years ago

A3C is just a way to speed up finding a value function by using multi-threading. They all "report back" to the mothership, in a sense. Since each is also facing different path, they solve in a sense the need to make one state different from the next state (decorrelate) to avoid (overfitting?) or otherwise suboptimal results in the test set.

To beat AC3 is really undefined. Can't be known what you want to do. Also, in a game, you may not need to plan. After all, you have the game and know the rules. Some games may not provide the model (watch the presentation by David at Google that Denny links to in the readme.md). For example, in Atari games, the game doesn't tell you where things will appear, or what will happen if you do Action_t (what is the next screenshot exactly?). But regardless of this, if it's an adversarial game about exploration, and you play against other people, you AI can be your advisor and can plan based on everything that you have learned. So even in simulated games, it may make sense to build a model when the environment doesn't produce one. And even if you are in a simulation, and it is not adversarial, it may still be important to build a model if the actual experience requires expensive calculation, but the model will utilize whatever it already has - so you can think of the model as a "type of cache". But all this has nothing to do with AC3.

dennybritz / reinforcement-learning

DQN and Dyna-Q #7