Open nerdoid opened 8 years ago
I've been seeing the same and I haven't quite figured out what's wrong yet (debugging turnaround time is quite long). I recently implemented A3C and also saw results peak earlier than they should.
The solutions share some of the code so right now my best guess is that it could be something related to the image preprocessing code or the way consecutive states are stacked before being fed to the CNN. Perhaps I'm doing something wrong there and throwing away information.
If you have time to look into that of course it'd be super appreciated.
I saw the same on a3c after 1.9 million steps:
And attributed it to the fact that I only used 4 learners (2 are hyper threads, only 2 real cores). In the paper, the training is orders of magnitude slower due to less stochasticity from fewer agents. The variance remained huge too, and it plateaus maybe too quickly in about 38 of score.
I started reviewing DQN first (though I got a high level view of how A3c is implemented too and will come back to it later), and made some dumb questions so far, as I was seeing the same level of performance.
One thing that would help diagnose performance is building the graph so we can inspect it in TensorBoard. Other than that, it could be many things. The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.
Thanks for helping out.
The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.
If the backprop calculation was wrong I think the model wouldn't learn at all. It seems unlikely that it starts learning but bottoms out with an incorrect gradient. Unlikely, but not impossible. It's definitely worth looking into, but it'd be harder to debug than looking at the preprocessing. A simple thing to try would be to multiply with a mask matrix (a one hot vector) instead of using tf.gather
.
Adding the graph to Tensorboard should be simple. It can be passed to the saver that is created in the code, e.g. tf.train.SummaryWriter(..., sess.graph)
.
Thanks for the responses.
So I manually performed some environment steps and displayed their preprocessed forms. An example is below. It looks OK to me. The paper also mentions taking the max of the current and previous frames for each pixel location in order to handle flickering in some games. It seems to be unnecessary here since I didn't observe the ball disappearing between one step and the next.
I'm happy to check out the stacking code next. Looking at TensorFlow's Convolution documentation, is it possible that information is being lost by passing the state histories to conv2d
as channels? Depending on how conv2d
combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d
is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)
Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)
I think you are right in that the history is being reduced to one representation after the first convolutional layer, but I think that's how it's supposed to be - You want to combine all the historic frames to create a new representation. If you used depthwise_conv2d
you would never combine information across frames.
I do agree that it seems likely that something is wrong in the CNN though... It could be a missing bias term (there is no bias initializer also in the fully connected layer), wrong initializer, etc..
I just looked at the paper again:
The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.
In my code the final hidden layer does not have a ReLU activation (or bias term). Maybe that's it?
Nice catch. That does seem like it would limit the network's expressiveness. Unless you already have something running, I'll go ahead and fire up an instance and give it a shot with ReLU on the hidden fully connected layer. Can also add biases to both fully connected layers (probably initialized to zero?).
I'm not running it right now so it'd be great if you could try. In my experience the biases never helped much but adding them won't hurt. Zero-initialized should be fine.
1) I don't know how to add the graph the right way to review with TensorBoard, since at tf.train.SummaryWriter(..., sess.graph)
time the session isn't created yet (but much later). If I try to change things I'll refactor the wrong way.
2) Some other things that can affect performance and that are discussed in several forums as things that have been done by the DeepMind team:
3) I really suspect the backprop is causing most of the trouble, and there are two potential problems with it.
The first (a) is regarding the Gather operation instead of the one_hot (that you mentioned, Danny), the second (b) is in asking the optimizer to think of vars in relation to multiple different examples (instead of vars we want fixed in general across an entire mini batch, which is usually how it will "think", that the same rules apply to the entire mini-batch).
a) The Gather operation operates very differently than a one hot. The Gather tells the optimizer to disregard completely the impact on the actions not taken as if they didn't affect the total cost. And it results in updates that disregard that we have placed a lot of effort on fine tuning the other variables because on the next state they may be the right action, and we want Q to provide a good estimate of those state, action values. But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.
b) The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch. Another way to say it is that I think mini-batch assumes you don't change the variables to be fixed between examples within a batch, and one way to look at it is to notice how the optimizer always works across Trainable Vars, and the List of Variables that minimize takes as argument, most likely expect to be vars used across the mini match for each example. eg. you may be optimizing only x1 and not x2. But may not work if you want to optimize x1 in the first example of the batch and then x2 in the second example of the batch. This I only suspect must be so, as in many other context mini-batch is derived from a cost function that is static (not fixing vars across the batch).
I hope I am not saying something too stupid. If so apologies. Bottom line is that reward and gradient clipping (I just noticed tensorflow makes that easy to implement), ending episode when life is lost, using a one hot instead of gather, and ReLU at the final fully connected layer, may combined boost learning significantly, here, and in what may apply from here to a3c (then I care the most about).
Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)
I just edited my response which was wrong. I really don't know how conv usually combines the channels. Simplest case is averaging, in cs231 hey say sum as in many other resources. The history isn't necessarily lost, and will be spread along the filters.
Unfortunately, it seems to have plateaued again. Currently at episode 3866. Here's the code I used:
fc1 = tf.contrib.layers.fully_connected(flattened, 512, activation_fn=tf.nn.relu, biases_initializer=tf.constant_initializer(0.0))
self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS), biases_initializer=tf.constant_initializer(0.0))
Is something like this possible? If the action is not taken, set the target value be equal to the current Q prediction for that action. For the (one) Action taken, set the target y value be the one of our target value q(s',a'). Or does it need a one hot? If possible, then
self.y_pl is filled with the TD target we calculated (using s',a') when the index corresponds to the action taken, and all other y targets (actions not chosen) are filled with just the predicted value from self.predictions (same value in Q vs target y). Then just:
# Calcualte the loss
self.losses = tf.squared_difference(self.y_pl, self.predictions)
self.loss = tf.reduce_mean(self.losses)
Thanks for the comments. Here are a few thoughts:
Clipping of rewards to range (1,-1) - I think breakout not affected
Good point. Likely to make a difference for some of the games.
Clipping of the gradients too, to limit any one update the hard way if necessary (not sure how they do this or how real, they'd need to mess with compute gradients output.
This won't make a difference in learning ability, it'll only help you avoid exploding gradients. So as long as you don't get NaN values for your gradients this is not necessary. It's easy to add though.
They consider terminal state after LOSS OF LIFE not after episode ends. This is reported to have a large impact in the learning rate and in the score you can achieve (likely, because without it, it's hard to notice for the agent that losing a life is really very bad and not "free")
Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?
But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.
I don't think this is right. It is correct to only take into the account the current action. Remember that the goal of the value function approximator is the predict the total value of a state action pair, q(s,a)
. The value of an s,a
pair is completely independent of other actions in the same state. In fact, the only reason our estimator even outputs values for all actions is because it's faster. In theory you could have a separate estimator for each actions (as it is done in regular Q-Learning). You may be thinking of policy-based methods like policy gradients that predict probabilities for all actions. There we need to take into account the actions that are not taken, but not here. I'm pretty confident the current code is correct.
The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch
I'm not sure if I understand this point. Can you explain more? However, I'm pretty confident that the optimization is right.
So I think it may be worth looking at the reward structure of the environment and what they do in the original paper...
I'll try to refactor the code on the weekend to look at the gradients in Tensorboard. That should give some insight into what's going on.
Denny,
Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?
In the paper, under Methods, Training, they mention this:
For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.
I've seen in some forums that they take this for granted, and they tried it in at least Breakout and it made a large difference.
Regarding the optimization, thanks for taking the time to explain. I highly appreciate it, and will sit on it and try to become more familiar with the concepts. My comment was inspired by comment by Karpathy that implemented REINFORCE.js and that in that page literally mentions under Bell & Whistles, Modeling Q(s,a):
Step 3. Forward fθ(st) and apply a simple L2 regression loss on the dimension a(t) of the output, to be y. This gradient will have a very simple form where the predicted value is simply subtracted from y. The other output dimensions have zero gradient.
But I see how I miss-read it. It confuses me a bit that having 10 outputs can be the same as having 10 networks dedicated to each action (not that both don't work, but how the 10 outputs can be as expressive).
I am also probably mixing things...such as when doing classification with softmax, we set labels for all classes, the wrong ones we send closer to 0 and the one that was right towards 1. So it trains the network to not only output something closer to one for the right label, but also trains the other labels to be closer to 0. In this case (DQN) I though similarly, that our "0" for the other actions was whatever value the function estimates (which is not relevant in the current state, but we don't want to change much because those action where not taken), thus I thought their equivalent to "0" was to set them to the value the function estimates, and the "1" was the cost we actually calculate for the action taken. I know I am still pulling my hair, ... :-)
For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.
Hm, but that doesn't mean the end of an episode happens when a life is lost. It could also mean that they use the life counter to end the episode once the count reaches 0. It's a bit ambiguous. It's definitely worth trying out.
I'm having a hard time wrapping my head around the impact of ending an episode at the loss of a life considering that training uses random samples of replay memory rather than just the current episode's experience. Starting a new episode is not intrinsically punishing, is it? A negative reward for losing a life is obviously desirable though. I'd love to be wrong because this would be an easy win if it does make a difference.
On another note, I manually performed a few steps (fire, left, left) while printing out the processed state stack one depth (time step) at a time, and it checks out. The stacking code looks good to me.
@nerdoid I'm not 100% sure either, but here are some thoughts:
I know it's a bit far fetched of an explanation and I don't have a clear notion of how it affects things mathematically, but it's definitely worth trying.
@fferreres I think your main confusion is that you are mixing up policy based methods (like REINFORCE and actor critic) with value-based methods like Q-Learning. In policy-based methods you do need to take all actions into account because the predictions are a probability distribution. However, in the DQN code we are not even using a softmax ;)
Just to clarify the above: By not ending the episode when a life is lost you are basically teaching the agent that upon losing the visual state is reset and it can accumulate more reward in the same episode starting from scratch. In a way you are even encouraging it to get into states that lead to a loss of life because it learns that these states lead to a visual reset, but not end of episode, which is good for getting more reward. So I do believe this can make a huge difference. Does that make sense?
Ahhh, yeah that makes a lot of sense. Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that. Either way it's worth trying.
Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?
Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that.
I think the key is that the targets are set to 0 when an life is lost, i.e. this line here:
targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor * np.amax(q_values_next, axis=1)
So when a life is lost done
should be true and the env should have a full reset.
Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?
It could. I agree this should also improve things but the learning problem would still be confusing to the estimator. The Q value (no matter if negative or positive) depends on how many lives you have left, but that information is nowhere in the feature representation.
I added a wrapper: https://github.com/dennybritz/reinforcement-learning/commit/18528d0dd743949e1555002be45af2053fc2b001#diff-14c9132835285c0345296e54d07e7eb2
I'll try to run this for A3C, would be great if you could try it for Q-Learning. Crossing my fingers that this fixes it ;)
Awesome! Yes, I will give this a go for the Q-Learning in just a bit here.
@dennybritz, yeah, I have things mixed up. Too many things to try to understand too quickly (new to Python, Numpy, TensorFlow, vim, Machine Learning, RL, refresh statistics/prob 101 :-)
Not sure if either of you have encountered this yet, but when a gym monitor is active, attempting to reset the environment when the episode is technically still running results in an error with message: "you cannot call reset() unless the episode is over." Here is the offending line.
A few ideas come to mind:
env.reset()
call in a try/except
. If we end up in the except
, it just means we lost a life rather than finishing. Skip resetting the state stack, and instead just continue like nothing happened. Important then to move the if done: break
after the state = next_state
at the end of the episode loop. We still benefit from the correct done
flag being added to replay memory in the previous step.record_video_every
episodes. So we do the bookkeeping manually, and start the monitor only when we want to record a new video. There is probably overhead with starting and stopping the monitor repeatedly. IDK though.The first option seems less hacky to me, but I might be overlooking something.
Here are a few threads discussing low scores vs paper. It can be a source of troubleshooting ideas:
https://github.com/tambetm/simple_dqn/issues/32
https://groups.google.com/forum/m/#!topic/deep-q-learning/JV384mQcylo
https://docs.google.com/spreadsheets/d/1nJIvkLOFk_zEfejS-Z-pHn5OGg-wDYJ8GI5Xn2wsKIk/htmlview
I re-read all the DQN.py (not the notebook) and assuming all np slices and advanced indexing and functions operating over axis are ok, I still can't find what could be causing the score to flatten at ~35.
I think troubleshooting may be time consuming but it is also a good learning experience
Insome of the posts some mention doing a max among consecutive frames, some that the stacking is for consecutive frames I've seen a karpathy post explaining policy gradients and he did difference of frames.
This was reported regarding loss of life:
I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind. Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master. With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.
Thanks for researching this. I made a bunch of changes to the DQN code on a separate branch. It now reset the env when losing a life, it writes gradient summaries and the graph to Tensorboard, and has gradient clipping: https://github.com/dennybritz/reinforcement-learning/compare/dqn?expand=1#diff-67242013ab205792d1514be55451053c
I'm re-running it now. If this doesn't "fix" it the in my opinion most likely source of error is the frame processing. Looking at the threads:
DeterministicBreakout-v0
environment to get a fixed frame skip.I can't imagine that each of these makes a huge difference, but who knows. Here's the full paragraph:
Working directly with raw Atari 2600 frames, which are 2103 160
pixel images with a 128-colour palette, can be demanding in terms of computation
and memory requirements.We apply a basic preprocessing step aimed at reducing
the input dimensionality and dealing with some artefacts of the Atari 2600 emulator.
First, to encode a singlef rame we take the maximum valuefor each pixel colour
value over the frame being encoded and the previous frame. This was necessary to
remove flickering that is present in games where some objects appear only in even
frames while other objects appear only in odd frames, an artifact caused by the
limited number of sprites Atari 2600 can display at once. Second, we then extract
the Y channel, also known as luminance, from the RGB frame and rescale it to
84 x 84. The function w from algorithm 1 described below applies this preprocessing
to the m most recent frames and stacks them to produce the input to the
Q-function, in which m=4, although the algorithm is robust to different values of
m (for example, 3 or 5).
Something entirely plausible is that we didn't try "hard enough". Look for example at breakout in this page:
https://github.com/Jabberwockyll/deep_rl_ale/wiki
It remains close to 0 even at 4 million steps, then starts climbing but beed millions and millions to make a difference. Also, in many of the kearning charts there are long periods of diminishing scores, but they recover later. It takes a million sometimes to do that.
Our tries are about 2-3 millions. I may have to check the reward curves from the paper, but maybe there's nothing wrong, if the plateu is actually just the policy taking a "nap". Unfortunately, I am stuck with an Apple Air (slow). Maybe with $10 I could try a3c on a 32 cpu processor...not very familiar with aws though but a spot instance for a day would do it
I'd stick with trying DQN instead of A3C for now - there are fewer potential sources of error.
If the implementation you linked to works correctly (it seems like it does) it should be easy to figure out what the difference is:
Looking at it again there another difference from my code. They take 4 steps in the environment before making an update. I update after each step. That's not shown in the algorithm section in the paper, but it's mentioned in the appendix. So that's another thing to add.
Good point about the training time. If it takes 10M steps that's 2.5M updates (assuming you update only after 4 steps) and that'd be like 2-3 days looking at the current training speed I have on my GPU.
There's also the original DeepMind code here, although the Lua code can be less than intuitive at times :)
DeterministicBreakout-v0
as that makes it easier to reproduce their specs.Breakout-v0
steps a random amount of frames.This was reported regarding loss of life:
I regarded the loss of life as the terminal state but didn't reset the game, which is different from >>DeepMind.
Meaning I don't reset the environment. I did this because a) the gym crashes if I reset the environment when the episode isn't really over (@dennybritz I'm confused about why I'm getting that crash but you don't seem to be), and b) I'm not 100% sure it's better to actually reset the state stack with a loss of life, rather than just continuing and exposing the agent to late game states. Not sure though. Anyway, I'm running that with episode limit of 50,000, and it's averaging maybe ~50 steps per "episode", which will give a max of 2.5M steps. Which I guess is not nearly enough from the jabberwocky graphs that @fferreres linked to. Crap.
@nerdoid just set max episodes to 200k. It won't really affect anything but give more time to see how learning evolves. Also, another thing for om DQN paper is that iirc they store the best policies. This is critiziced, and not so "pure" but that's what was reporter and getting the higher scores. i.e.. for some games, after a while, they may start to "un-learn" too.
I have a limit in testing in my laptop which is 4GB (not expandable). I think I need a new laptop or rig, can't test much.
Dears
I would like to thank you all for the fruitful discussion. I have tried the A3C code and for around 1 million frames, and the performance looked fine (enhancing over time).
Regarding your discussion, I understand that the agent actually learns, but at some point it does not get any better.
My first guess would be playing with hyper parameters, mainly the learning rate should we use smaller and decaying learning rate?
Second, should the agent get punished for losing a life? My answer would be yes (it is bad to lose life) without resetting the episode. If the agent is not punished, my guess, the learning will be slower (progressing but slower because of delayed negative reward).
Entropy, In A3C paper, it was mentioned that:
We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies .... The hyperparameter β controls the strength of the entropy regularization term.
I think β was 0.01 in the code as in the paper. However, Is it possible, for some reason, that the agent converged to suboptimal policy?
Finally, I understand that Atari 2600 and OpenAi Gym are not compatible, and results should not be compared. However the differences should be marginal not that big as we see here.
What do you think?
Thank you
the differences should be marginal not that big as we see here.
We don't know until someone trains to to 10M to 50M steps. Denny did many improvements in the dqn version and awaiting how it impacts learning.
I'm still looking into it (and training)
I think it would be good to not talk about A3C in this thread. This is about DQN, not A3C. There may be other issues with A3C that are completely unrelated to this and talking about both algorithms here is more confusing than helpful. We should make sure that DQN works first.
Hi everybody, I have been training a DQN based on Denny's code and I would like to join the debug process. I took a look at the original code and I have one comment that may be useful in the current context:
I am willing to make some experiments and check the performance of the model. I would like to know, however, how long does it usually take for you guys to train the DQN. I have been training one model for over a week already and I estimate it will need a couple of days more... If this is the case, debugging will be quite complicated XD
By the way, what do you guys mean with DeterministicBreakout-v0? I have not been able to find anything about that online...
Check out the dqn/ branch in this repo. I made some changes there, but it's not done yet. In particular, the eval is still missing. I also realized what you mentioned above about the eval, but somehow gym got stuck when evaling over multiple episodes and I couldn't get it to work. That was a while ago though, they may have fixed it by now.
If you look at the gym code you can see that they have deterministic environments, but I don't think it really matters for DQN here. Doubt that it'll make a difference.
Hi guys, I have spent a lot of time implementing my DQN code and I have solved many of the same challenges that you seem to be facing. Right now I am crazy busy and I don't have the time to help you debug your code but I can give you some pointers of the things that are important and the things that aren't. Also, I haven't read all of discussion or the code so I might be pointing out things that you already know. Sorry about that.
Important stuff:
Not really sure how important:
Little difference (at least in breakout):
I am probably missing a lot of other important titbits. If I remember anything else I will come back. But probably you will see big improvements if you implement all of this correctly.
If you have any doubts you can check out my code
Thanks a lot @cgel, I'll try to spend some time next weekend to implement these changes. The optimizer seems like quite a bit of work though so I may skip that part and hope it still works. Intuitively I can't imagine that the optimizer would explain the difference between working and not working.
I am not 100% sure how big the effect was, but I think that it was more than considerable.
Just copy the function from my repo it is in DRL/agents/commonOps.py
and it is called graves_rmsprop_optimizer
. It is interchangeable with the call to the tf optimizer.
Otherwise you will probably need to run a very expensive optimizer-hyper-parameter sweep.
Hey @cgel, thanks a lot for the comments. I have been checking your code and I have two basic questions:
You said that we should pass terminal flag when life is lost, but I do not see that happening in your code. Are you also using the gym environment? Because both when training and testing, you have the following code, and you won't get a terminal flag when a life is lost, only when the game is over:
while not done:
action = agent.step(x, r)
x, r, done, info = env.step(action)
score += r
I noticed that you use a thread for training the network which starts running when _steps_beforetraining have been performed (nice idea by the way). This means that you do not really train the network after a fixed amount of steps, but you count on the computer to collect several experiences before the thread has finished executing the training function. Is that right?
steps_before_training
. Then it collects transitions and trains. My usage of multi thread is a bit messy. The objective is to have the update computation running parallel to the step in the environment. A few commits ago I simply called update
once step_count > steps_before_training
.Hi all,
if you're wondering about the optimizer that the NATURE paper used, the deep_q_rl library has a file, updates.py which uses what presumably was the correct gradient update rule, and indeed it wasn't RMSProp (@cgel is correct).
Edit: and I should add that I've run the deep_q_rl library many times and have gotten Breakout scores in the 300s-400s so something's going right. Actually, there was an issue over there in deep_q_rl that's similar to this one where the author wasn't getting the correct scores, but it looks like he/they resolved it. Part of it may have been due to gradient clipping.
-Daniel
Denny, and progress or ideas regarding these? It's a bummer emotionally - after DQN failed, I lost lot of faith in my own skills too. I can't fathom what's wrong at all.
Sorry, I've been busy with conference deadlines and haven't had time to look into this yet. Will do soon.
Awesome. I was checking some twitter posts, surely sore so much new and exciting it's almost impossible to cover everything.
@cgel Could you please elaborate a bit more on this suggestion?
don't tf.reduce_mean the losses in the batch. Use tf.reduce_max
Yes. The final scalar loss you are trying to optimise should be the sum of the individual losses of the batch (as opposed to their mean)
Hi Denny,
Thanks for this wonderful resource. It's been hugely helpful. Can you say what your results are when training the DQN solution? I've been unable to reproduce the results of the DeepMind paper. I'm using your DQN solution below, although I did try mine first :)
While training, the TensorBoard "episode_reward" graph peaks at an average reward of ~35, and then tapers off. When I run the final checkpoint with a fully greedy policy (no epsilon), I get similar rewards.
The DeepMind paper cites rewards around 400.
I have tried 1) implementing the error clipping from the paper, 2) changing the RMSProp arguments to (I think) reflect those from the paper, 3) changing the "replay_memory_size" and "epsilon_decay_steps" to the paper's 1,000,000, 4) enabling all six agent actions from the Open AI Gym. Still average reward peaks at 35.
Any pointers would be greatly appreciated.
Screenshot is of my still-running training session at episode 4878 with all above modifications to dqn.py: