Open Joshuaalbert opened 6 years ago
Yap i thought in similar way but turned that the code seems work correctly.
I had another thought. Isn't it unnecessary to have a target network for this notebook in the first place? Since you are setting the target network to be equal to the mainDQN right before training?
In the notebook I don't see where your recurrent Q value model gets its trace dimension. You're just reshaping the output of a convnet and feeding this directly into an LSTM. Furthermore, should you not also provide the non-zero initial state determined at play time? I.e. the internal state should be stored in the experience buffer and used during training. Corrent me if I'm wrong please.