Open IbrahimSobh opened 7 years ago
After reshaping the reward:
game.set_living_reward(1) # Each step is good for you!
game.set_death_penalty(500) # And death is not!
.
.
.
.
r = self.env.make_action(self.actions[a]) * 1.0
agent_health = self.env.get_game_variable(GameVariable.HEALTH)
if agent_health == 100:
r = r + 50
Results: (disaster!!)
1.Your reward value & gradient are very large. You can compare these to the original Tensorboard of the Doom tutorial to see what I mean. I assume you adjusted the gradient clipping from the tutorials 40 to a different value? I would try, like in the tutorial, normalizing to the ideal score so your ideal value is ~1.0. Then eyeball the grad norms to ensure your clip value is meaningful.
2.Relating to the reward value(and the next point) I would ensure the primary goal's reward is clear and their contributions prioritized as you would in any loss function. The death penalty is very high, as is the aggregate of staying alive. I would play around with these values and maybe encourage picking up health a bit more.
3.These health level episodes are pretty long since you are trying to maximize episode length, which is working against you to bury the details of what actions led up in success. Perhaps increasing the experience buffer limit from 30 to a larger number could help better capture some of the information in each episode. From the authors comments: "the network can get a more clear understanding of the environment with 100 steps of experience for example as opposed to 30." Source: https://medium.com/@awjuliani/hi-og-46c143a494af#.kawpktsfg
Thank you so much @DMTSource
Point Number 1
Could you please send us me a link for tensorborad of Doom tutorial (I am very sorry I could not find it)
As I understand, we use gradient clipping to make sure that gradients do not get too high, however in A3C Doom code, basic Doom scenario, in this repo it is:
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
I did not change it in health gathering scenario. why it is 40.0 there and why it should be 1.0? In other words, how exactly we select the clipping value?
Point Number 2
I tried to encourage the agent to pick health by adding some big reward if its health is 100.
if agent_health == 100:
r = r + 50
I also make the death penalty very high to teach the agent NOT to die
What is wrong with this?
Point Number 3
I will try increasing the buffer to 100, it makes a lot of sense , thank you
For reference:
Here is the result for Doom Basic scenario:
I can see the following:
Policy loss and Value loss both go gradually (I believe this is good because we want to minimize the loss)
By time, Episode length get shorter (this is good because we want to shoot the enemy as fast as possible)
By time, Reward and value get higher
Entropy: Start by 1.1 (it should be 1.0!) (high uncertainty) and ends by values (0.75 to 1.0 for each agent/worker)
What about gradient_Norm, entropy and Var_Norm? how should we interpret them
The main thing to take away from the doom demo for now, I am guessing, is the Reward magnitude. The original demo A3C-Doom.ipynb uses "r = self.env.make_action(self.actions[a]) / 100.0"
You need to adjust this 100 to around the max possible episode total reward. This will help allot with preventing your gradient normal's from exploding. Then you can begin to explore a useful clip value(maybe start high at like 100 and then reduce it once you see some plots of the Grad Norm and how it acts once some training has occurred).
I would also suggest having the max possible episode total punishment not be too far from -1 after you perform this normalization. You want the network to learn, but too hard a punishment might yield a bumpy gradient once again. I mentioned this before when reducing the death penalty, but only if it was very large compared to the max possible episode reward.
Dear @DMTSource Dear @awjuliani
After many trials, finally I got reasonable performance: (I do need your comments and suggestions plzzzz)
I encouraged the agent to pick the medical kits:
The code:
prev_health = self.env.get_game_variable(GameVariable.HEALTH)
r = self.env.make_action(self.actions[a]) / 100.0
next_health = self.env.get_game_variable(GameVariable.HEALTH)
is_dead = self.env.is_episode_finished()
if is_dead == False:
if next_health > prev_health: # if not dead and have medical,then reward is high
r = 30.0/100.0
Results
Given all the above:
what do you think?!
Would it be possible to do a much longer run? May I also ask what your learning rate is? Can you try a value one order of magnitude larger(say 1e-3 instead of 1e-4)?
Your Entropy curve looks like it was just about to 'fall off a cliff' and start heading to zero(or wherever it converges). Same for your loss and your reward. It looks like things were only just getting awesome and the training was stopped.
Sometimes the agent can learn something nifty, but lame. That can lead to a sub optimal strategy which can give you the high variance. Imagine a child learning a bad golf swing: they then having to unlearn and then relearn a new swing. Try giving it more time and wait for a convergence across the graphs to show "hey this is where I ended up". If that "over fits" to a bad solution THEN I'd say inspect the graphs for a good stopping point and or go back to the reward system and see what can be improved. You have to imagine what sort of strategies the rewards can lead to because the agent is going to game that system and get away with anything it can.
Thank you @DMTSource I do appreciate your comments ..
learning rate: the same as the original code Here are the results after some small time ... it seems the agent is collapsing!! why?!!!!!!
I almost cried :(
To my previous point, can an imbalance arise due to the per frame reward? I worry that this is not instructive. Sure stay alive, but what about wandering around? The key is to run from reward to reward Im guessing? So perhaps minimize action by putting a cost(-1) on each movement and (-2) if it does nothing? Something like that might help encourage is to figure out a method to better identify and move to targets as quickly as possible." Might" being the key word there.
Yes the agents seems to run away above. I have been seeing this allot in my own agents and their custom environment. I assume its something like an exploding gradient or some issue as the entropy actually goes to zero. The network will probably nan out after a while. Perhaps @awjuliani can help us with that.
One suggestion is the total reward plot: once again is in the ~20 range. You want to normalize that value back down to 1 like I mentioned in previous posts as it will lead to larger and wonky gradients with large values. Just increase your normalizing factor in the worker function(i think), you dont have to mess with the reward system or anything fancy.
Thank you so much @DMTSource
I think the problem of exploding gradient is solved by gradient clipping
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
or, do we need to take care of gradients even after the clipping?!
@DMTSource @awjuliani
I used skip count = 4, I mentioned here
But again. the agent collapsed after such promising performance!!
here is a long run with skip count = 4
The agent achieves high scores then goes down then again achieves high scores .... and finally, it become an idiot agent!
The yellow marks to indicate that the agent was smart enough to make 2100 episode without dearth.
what is going on :(
@DMTSource
After making some small changes (based on our discssion above), I got much better reasonable performance, but the agent collapsed at the end!
here are the other details ...
Looks great before the collapse! Maybe you can stop the training around 1k when things look good and restart/load the model with a lower learning rate or other changes in the hyperparams to explore ways to avoid the collapse.
Aside from the length, all the plots are showing triangles at the collapse(nan values). I ran into this issue with #27 and have been trying to explore ways to prevent the collapse.
Thank you @DMTSource
Thanks for your patience ...
Once and for all, I need to review again how to set:
The clipping value? (40.0) How this will affect the training?
My Understanding:
In Deep Learning, some times we suffer from gradient exploding, the solution is simple: Gradient Clipping. In other words if the gradient is larger then 40.0 make it 40.0 and if smaller than -40.0 make it -40.0. We can know the gradient of trainable variables using:
self.gradients = tf.gradients(self.loss,local_vars)
Moreover, we monitor the Norm of our trainable variables using self.var_norms
self.var_norms = tf.global_norm(local_vars)
By the way, in almost all cases, Var Norm is around 35 to 40 (I think it depends on the network initialization) correct?
Generally speaking; trainable variables are updated as follows:
trainable_variable = trainable variables + (learning rate * gradient)
Conclusion: (I am not sure)
We should update the training variables in a reasonable way:
However, the value of the reward that the agent receives, has a very big effect on the whole training process. Large rewards: Instead of (r / 100.0) I used (r/ 10.0), there was no convergence at all! and gradient Norm was around 3000 ~4000 (very large gradient)
Small rewards: Instead of (r / 100.0) I used (r/ 200.0), there was a delayed convergence. During training, the Gradient Norm was around 15
In both cases, the Var Norm was around 35 ~ 40
The question is: How to set the values for:
How to make wise decisions based on these plots:
@DMTSource @awjuliani
Your comments are very appreciated
After playing with numbers here is another promising result:
Notes:
but after some time .... the same NaN problem happens ....
@DMTSource @awjuliani
By changing this value from 30 to 3, everything exploded! why?
if len(episode_buffer) == 30 and d != True and episode_step_count != max_episode_length - 1:
to
if len(episode_buffer) == 3 and d != True and episode_step_count != max_episode_length - 1:
I think this is the sequence length for the LSTM layer (correct?), making the sequence too small (3 instead of 30) caused a problem .... why?!
My reason:
Having small 3 LSTM sequence means that we have only 3 recurrent layers in time domain, gradients are too large as the sequence is small (gradients usually vanish when we have long sequence) ... correct?
Again, how to adjust all of these parameters?!! rewards, gradients, LSTM length, ....
A possible solution for NaN problem, what do you think?
http://stackoverflow.com/questions/33712178/tensorflow-nan-bug
After try and error, here is my best result till now
I there a better way to choose these numbers?!!
What sort of initialization are you using on your weights? You might get a better answer faster using Xavier initialization, a very popular technique in Tensorflow.. Try setting the initializer/weights_initializer to tf.contrib.layers.xavier_initializer()
As for a "better way", you could perform a Baysean optimization using something like Spearmint to automatically optimize the hyper parameters with multiple concurrent experiments(single or multiple machine if you have the resources) to speed up the convergence of the Gaussian Process model. It would still take allot of time but you would have to do no manual sampling of parameter combinations.
On Apr 25, 2017 8:08 PM, "Ibrahim Sobh" notifications@github.com wrote:
After try and error, here is my best result till now
- LSTM length = 9 (instead of 30)
- Clip = 55.0 (instead of 40.0)
- reward = 0.0 except when agent picks a medical kit: reward = 0.0643 (50/700)
- skip count = 4
I there a better way to choose these numbers?!!
[image: lstm9_r45by700_clip55] https://cloud.githubusercontent.com/assets/19908396/25412712/0ea60770-2a25-11e7-8505-07757d11a74f.PNG
[image: lstm9_r45by700_clip55_episodelength] https://cloud.githubusercontent.com/assets/19908396/25412720/15a2e782-2a25-11e7-9fa1-c784b5afeb14.PNG
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awjuliani/DeepRL-Agents/issues/18#issuecomment-297200495, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHXdkK9WetyrqD2WRwpaguStEE9SmtDks5rzosJgaJpZM4MfykN .
@DMTSource thanks for the prompt response
I am using the same initialization in the original A3C Doom code
For conv layers (I think the default initialization is used)
self.conv1 = slim.conv2d(activation_fn=tf.nn.elu,
inputs=self.imageIn,
num_outputs=16,
kernel_size=[8,8],
stride=[4,4],
padding='VALID')
For Policy and Value, a special function is used for initialization:
self.policy = slim.fully_connected(rnn_out,a_size,
activation_fn=tf.nn.softmax,
weights_initializer=normalized_columns_initializer(0.01),
biases_initializer=None)
self.value = slim.fully_connected(rnn_out,1,
activation_fn=None,
weights_initializer=normalized_columns_initializer(1.0),
biases_initializer=None)
Where this function is used by @awjuliani (I wounder why he did not use one of the supported initializers)
#Used to initialize weights for policy and value output layers
def normalized_columns_initializer(std=1.0):
def _initializer(shape, dtype=None, partition_info=None):
out = np.random.randn(*shape).astype(np.float32)
out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
return tf.constant(out)
return _initializer
As far as I understand Xavier initialization is not good for RELU, is it better for conv layers (ELU), policy and value?
Did you try it and found it better? (experiments take too long)
Best Regards
I agree and have also read you def don't want to use Xavier with rnn's as well. I think the FC layers would benefit the most. I will give it a try in the FC layers only and a learning rate of 1e-3 to see if training can be sped up any.
One user had a great deal of luck on previous tutorials by initializing the FC layers: https://github.com/awjuliani/DeepRL-Agents/pull/32
Thank you @DMTSource
For faster results I tried the Doom Basic Scenario. I used Xavier only for policy and value FC layers, as follows:
self.policy = slim.fully_connected(rnn_out,a_size,
activation_fn=tf.nn.softmax,
#weights_initializer=normalized_columns_initializer(0.01),
weights_initializer=tf.contrib.layers.xavier_initializer(),
biases_initializer=None)
self.value = slim.fully_connected(rnn_out,1,
activation_fn=None,
#weights_initializer=normalized_columns_initializer(1.0),
weights_initializer=tf.contrib.layers.xavier_initializer(),
biases_initializer=None)
Here is the comparison:
Using normalized_columns_initializer
as in original code:
Using tf.contrib.layers.xavier_initializer
:
Conclusion: In this Doom Basic scenario, there is not big difference, however Xavier episode length looks smoother when converging
Learning rate: Again, in Doom Basic scenario, I tried different learning rates:
Thanks and waiting for your comments
Best Regards
@DMTSource
Here is the performance of health gathering scenario (exactly like above, but using Xavier initialization )
Compare this with the original code initialization below ...
Xavier looked promising and smoother, but collapsed!
Here is the best/stable/fast result so far:
I used labels buffer and label[label> 0] = 255
I concatenated the agent location and health to the flat fully connected layer (after CNN)
Skip count = 4 (I want to try 3)
Reward for picking medikit = 0.1 (if new_health > old_health after executing action)
Reward for being alive = 0.001 (small)
LSTM length = 6 (if len(episode_buffer) == 6 and ...
)
but after some time, the agent collapsed as usual (without NaNs)
and after more time ...
why?! :(
@IbrahimSobh , How is your result now?
Have you tried combinations of actions in the example code of ViZdoom?
In [1]: import itertools as it
In [2]: actions = [list(perm) for perm in it.product([False, True], repeat=3)]
In [3]: actions
Out[3]:
[[False, False, False],
[False, False, True],
[False, True, False],
[False, True, True],
[True, False, False],
[True, False, True],
[True, True, False],
[True, True, True]]
In each step, the agent can take a combination of buttons.
Hi @GoingMyWay
Mainly, I make the learning rate smaller (1e-4 / 4), and got slower convergence but more stable .... hyper-parameter optimization is a nightmare
no, I did not try this, I think three actions is simpler ...
@IbrahimSobh Are you running on a GPU or CPU? I read of someone having issues with multithreading, GPU & tensorflow ... he solved the problem by running it on CPU.
@wolleraudude, I ran the code with GPU
with tf.device('/gpu:1'):
spawning multiple agents
After 10 hours training, the performance collapsed, my scenario is deadly_corridor
not healthpack gathering
@GoingMyWay I am running "breakout" from ai gym & mine also collapses. I was running on GPU. I might give it a try on CPU...
@wolleraudude @GoingMyWay
I am running on CPU (as A3C paper suggests ans as Arthur code code should run)
Hi Arthur
Instead of basic scenario, I used health_gathering.cfg scenario
Where:
and
r = self.env.make_action(self.actions[a]) * 1.0
r = self.env.get_game_variable(GameVariable.HEALTH)
r = r + self.env.get_game_variable(GameVariable.HEALTH)
What do you think?