Open JasAva opened 5 years ago
I also tried to train agents (train.py) for BeamRider and Enduro, however, after around 5e5 timesteps, the agents seem didn't learn anything. Based on the experience on vanilla DQN, the avg rewards should increase to a certain level at this timestep.
BeamRider stuck at avg 240+; while Enduro stuck at 0.0, which are basically random action. I didn't alter any parameters and used the tensorflow version 1.8.0.
Since the pre-trained agents are really powerful, is there any chance the wrong version of the code is released (I tried both master and tag 2.0.1)?
I did observe the warning when building models: WARNING:tensorflow:Output "dense_1" missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to "dense_1". Building model. WARNING:tensorflow:Output "dense_3" missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to "dense_3". Copying weights to target.
Really appreciate if you can check this, eager to try the PER DQN, thanks!
Hi @JasAva, and thanks for the feedback. Nice to know that someone is getting some use out of the code. It sure took a long time to get it up and running!
Unfortunately, I don't have a good reason for the Space Invaders performance, which is similar to the original DQN paper (1976 +- 893). My implementation doesn't exactly match the PER paper: some of the hyper-parameters differ, and the loss function is also different (as mentioned in the README.md file). I assume that those differences account for the difference in score, but it's entirely possible that it's something else. If I get a chance, I'll investigate the difference further, but I haven't had much time to work on this project this year. My implementation performed pretty well on the four games I tested it on, and I kind of left it at that =)
Regarding training, 500K frames simply isn't enough time. In the original DQN implementation, Playing Atari with Deep Reinforcement Learning, a much smaller network was used (16x32x256). Further, the agents were trained over 50 million frames. In DeepMind's Nature publication, Human-level control through deep reinforcement learning, the network size is 32x64x64x512--much larger--and the training length is quadrupled to 200 million frames. My implementation uses the larger network, and so the training takes a long time.
For some specific numbers, an epsilon-greedy approach is used for exploration. Epsilon decays from 1 to .1 over the first million frames, then from .1 to .01 over the next 24 million frames. This strategy was employed by OpenAI (https://openai.com/blog/openai-baselines-dqn/) and resulted in an improvement. But given that annealing strategy, you're not going to see decent scores until 25 million frames or so (my implementation starts testing the model after 20 million frames). Take a look at the hyper-parameters defined here: https://github.com/benbotto/bsy-dqn-atari/blob/master/agent/trainer_agent.py (lines 32-84).
As far as versioning, 2.01 is the correct tag. That's what I used to generate all the reported numbers and is the most recent stable version. I try to follow semantic versioning (semver) in all of my projects.
And lastly, regarding the warning, it's expected. The warning comes up because of the way the importance sampling weights are passed through the network (https://github.com/benbotto/bsy-dqn-atari/blob/master/agent/trainer_agent.py). There are further details on StackOverflow about the approach I used, here: https://stackoverflow.com/questions/50124158/keras-loss-function-with-additional-dynamic-parameter
@benbotto Really appreciate the quick and thorough responses.
As you suggested, I let the agent trained for one night, now around 3.5e6 timesteps, the three agents (breakout, enduro, beamrider) have increased avg scores. I'll keep trying.
The code you wrote is really amazing, I have tried multiple versions of PER I found online, didn't work well. I'll keep reading your code. Just curious, how long it took you to train the complete agent? I observe the training becomes really slow after I trained for 10 hours (fast at the begining of the training). From my understanding, the learning already starts after 50000 timesteps, is it because of the sumtree structure?
I tried to implement the PER by combining this pytorch implementation (https://github.com/TianhongDai/reinforcement-learning-algorithms) with the openai baseline PER (https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py), I modified the openai code but sadly it didn't work. However, the training process is much faster. The speed is constant.
Sorry if I post too many questions. :)
I also have some questions regarding computing the TD error, I see you put the clip the gradients in the Huber loss, means the priority can be larger than 1, however, in the original paper, the TD error itself is clipped (don't know for sure), means the priority will always less than 1. Do you think the numerical differences affect the priority update? I see the openai baseline, they didn't actually clip the error, just noted as potentially clipped. (https://github.com/openai/baselines/blob/fa37beb52e867dd7bd9ae6cdeb3a5e64f22f8546/baselines/deepq/build_graph.py#L413)
Training for 200e6 frames takes about 30 days on my rig, which is a core i7 with an Nvidia GTX 1080 (the 11G version). Training will get slower and slower up until 25e6 frames, at which point the training speed will be relatively constant. That's because 25e6 is when epsilon fully decays and hence the most predictions have to be made. Also note that the code starts testing the model at 20e6 frames, and testing takes awhile.
On the topic of speed, if I were to rewrite this thing from scratch, I probably would not use Keras. Instead, I would use raw Tensorflow and move the bulk of the training algorithm to the GPU. I think that would speed things up quite a bit, although it would add complexity. Basically, this loop could be offloaded to the GPU: https://github.com/benbotto/bsy-dqn-atari/blob/master/agent/trainer_agent.py#L173
The learning does start at 50e3 timesteps, yes. It's not necessarily because of the sum-tree, it just gives the agent time to seed the replay memory, storage structure aside. The number is rather arbitrary and doesn't make much difference in the grand scheme of things.
I, too, had little luck with other PER projects. I reported bugs against a number of projects, including OpenAI (https://github.com/takoika/PrioritizedExperienceReplay/issues/2 and https://github.com/openai/baselines/issues/176 to point to a few). My main complaint about OpenAI is that they publish results, but don't tag their releases so that others can reproduce them. Then, later down the road, they break something in their code and don't take the time to fix it. Plus their code is horribly difficult to follow (code written by professors =)).
I'm not sure I'm following the question about TD errors correctly, so I apologize if I'm off base, but there seems to be some confusion between priority and loss.
Priority is based on error: if the network predicts that the score for a sample will be 2 and the actual score is 7, then the error is 5. The priority is then 5^.6 (alpha is fixed to .6). That priority is then used when pulling random samples, and the probability of selecting each sample is relative to the sum of all priorities. So, depending on the game, the priorities may range from 0.01 to 1, or they may range from 100 to 50e6. It doesn't really matter since the sampling is relative.
Then there's the importance sampling weights (IS weights). For each sample, an IS weight is calculated based on the probability of picking that sample. Low-priority samples thus have IS weights near 1, whereas high-priority samples have IS weights near 0. These IS weights are passed through the network with the samples, handed off to the loss function, and then multiplied by the error. This effectively makes the error smaller in such a way that samples with high priority adjust the network weights very slowly.
That's all prior to clipping. There's an error (actual - predicted) multiplied by an importance weight. If the error is in the range -.5 to .5, then squared error will be used to adjust the network weights. Outside of that range, the adjustment will always be 1 or -1. That's the derivative of the huber loss function (quadratic for small errors, linear for large ones).
In other words, the gradient will always be in the range [-1, 1]. In the beginning of training, the priorities vary wildly because the network makes poor predictions. But later, toward convergence, the priorities tend to be more uniform.
I have an answer on StackExchange that talks about PER and IS weights: https://datascience.stackexchange.com/questions/32873/prioritized-replay-what-does-importance-sampling-really-do/33431#33431 That might also be helpful.
@benbotto Thank you very much for the thorough explanations, I'll keep reading the code and try the training. I'll let you know if I have further questions. Thanks again :)
First of all, thanks for the amazing implementation, really helpful for understanding the DQN. I'm curious about the results on the space invaders, it shows the avg 2772, while PER original paper shows 9063. Other three games achieve/outperform the original release. (Really good job! :)) :+1: Can you share some insights on why the space invaders is not matched or should it compared to other results in the original paper? Thanks! :)