A3C Doom: Health Gathering

IbrahimSobh commented 7 years ago

Hi Arthur

Instead of basic scenario, I used health_gathering.cfg scenario

doom_health

Where:

        game.set_doom_scenario_path("health_gathering.wad")
        game.set_screen_resolution(ScreenResolution.RES_160X120) 
        game.set_screen_format(ScreenFormat.GRAY8)
        game.set_render_hud(False)
        game.set_render_crosshair(False)
        game.set_render_weapon(False)
        game.set_render_decals(False)
        game.set_render_particles(False)

        game.add_available_button(Button.TURN_LEFT)
        game.add_available_button(Button.TURN_RIGHT)
        game.add_available_button(Button.MOVE_FORWARD)

        game.add_available_game_variable(GameVariable.HEALTH)

        game.set_episode_timeout(2100)
        game.set_episode_start_time(10)

        game.set_window_visible(False)
        game.set_sound_enabled(False)

        game.set_living_reward(1) # Each step is good for you!
        game.set_death_penalty(100) # And death is not!

        game.set_mode(Mode.PLAYER)
        game.init()
        self.actions = [[True,False,False],[False,True,False],[False,False,True]]
        #End Doom set-up
        self.env = game

and

r = self.env.make_action(self.actions[a]) * 1.0

It seems that the agent is not learning!

d_h

I am thinking to use HEALTH to help the agent:

r = self.env.get_game_variable(GameVariable.HEALTH)

OR, to reshape the reward:

r = r + self.env.get_game_variable(GameVariable.HEALTH)

Or, should I wait for more time?!

What do you think?

IbrahimSobh commented 7 years ago

After reshaping the reward:

game.set_living_reward(1) # Each step is good for you!
game.set_death_penalty(500) # And death is not!
.
.
.
.
r = self.env.make_action(self.actions[a]) * 1.0
agent_health = self.env.get_game_variable(GameVariable.HEALTH)

if agent_health == 100:
        r = r + 50

Results: (disaster!!)

health_reshape

DMTSource commented 7 years ago

1.Your reward value & gradient are very large. You can compare these to the original Tensorboard of the Doom tutorial to see what I mean. I assume you adjusted the gradient clipping from the tutorials 40 to a different value? I would try, like in the tutorial, normalizing to the ideal score so your ideal value is ~1.0. Then eyeball the grad norms to ensure your clip value is meaningful.

2.Relating to the reward value(and the next point) I would ensure the primary goal's reward is clear and their contributions prioritized as you would in any loss function. The death penalty is very high, as is the aggregate of staying alive. I would play around with these values and maybe encourage picking up health a bit more.

3.These health level episodes are pretty long since you are trying to maximize episode length, which is working against you to bury the details of what actions led up in success. Perhaps increasing the experience buffer limit from 30 to a larger number could help better capture some of the information in each episode. From the authors comments: "the network can get a more clear understanding of the environment with 100 steps of experience for example as opposed to 30." Source: https://medium.com/@awjuliani/hi-og-46c143a494af#.kawpktsfg

IbrahimSobh commented 7 years ago

Thank you so much @DMTSource

Point Number 1

Could you please send us me a link for tensorborad of Doom tutorial (I am very sorry I could not find it)

As I understand, we use gradient clipping to make sure that gradients do not get too high, however in A3C Doom code, basic Doom scenario, in this repo it is:

grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)

I did not change it in health gathering scenario. why it is 40.0 there and why it should be 1.0? In other words, how exactly we select the clipping value?

Point Number 2

I tried to encourage the agent to pick health by adding some big reward if its health is 100.

if agent_health == 100:
        r = r + 50

I also make the death penalty very high to teach the agent NOT to die

What is wrong with this?

Point Number 3

I will try increasing the buffer to 100, it makes a lot of sense , thank you

For reference:

Here is the result for Doom Basic scenario:

doom_basic_all

I can see the following:

Policy loss and Value loss both go gradually (I believe this is good because we want to minimize the loss)
By time, Episode length get shorter (this is good because we want to shoot the enemy as fast as possible)
By time, Reward and value get higher
Entropy: Start by 1.1 (it should be 1.0!) (high uncertainty) and ends by values (0.75 to 1.0 for each agent/worker)
What about gradient_Norm, entropy and Var_Norm? how should we interpret them

DMTSource commented 7 years ago

The main thing to take away from the doom demo for now, I am guessing, is the Reward magnitude. The original demo A3C-Doom.ipynb uses "r = self.env.make_action(self.actions[a]) / 100.0"

You need to adjust this 100 to around the max possible episode total reward. This will help allot with preventing your gradient normal's from exploding. Then you can begin to explore a useful clip value(maybe start high at like 100 and then reduce it once you see some plots of the Grad Norm and how it acts once some training has occurred).

I would also suggest having the max possible episode total punishment not be too far from -1 after you perform this normalization. You want the network to learn, but too hard a punishment might yield a bumpy gradient once again. I mentioned this before when reducing the death penalty, but only if it was very large compared to the max possible episode reward.

IbrahimSobh commented 7 years ago

Dear @DMTSource Dear @awjuliani

After many trials, finally I got reasonable performance: (I do need your comments and suggestions plzzzz)

I used label_buffer instead of screen_buffer (I think it will be much easier for learning, correct?)

I encouraged the agent to pick the medical kits:

Agent get the normal reward (1.0/100.0) for being alive, and gets much more reward (30.0/100.0) if it gets a medical kit (if its health increased)

The code:

prev_health = self.env.get_game_variable(GameVariable.HEALTH)
r = self.env.make_action(self.actions[a]) / 100.0
next_health = self.env.get_game_variable(GameVariable.HEALTH)
is_dead = self.env.is_episode_finished()
                    if is_dead == False:  
                        if next_health > prev_health: # if not dead and have medical,then reward is high
                            r = 30.0/100.0

Results

health01

In this scenario there are much more possible states compared to the doom basic scenario
Policy loss looks strange compared to the doom basic scenario
Value loss is converging to 0.7 not zero!
The maximum episode length is 2100 (as in the original cfg of vizdoom)
At the end of training, agents can survive for 1600 steps or more (dummy agent can only survive for 500 steps)

Given all the above:

What is the best value for gradient clipping? (I used 55, but I guess 70 or 60 could be better?!)
In the figure, Gradient Norm exceeds 100 however in the code I mage the clipping value 55, why?!
The agent can get high rewards (and episode length is large), however the performance variance is high (In other words, the agent, many times, get high rewards, but not in most of the time, some times performance is poor, why? and how to fix? )

what do you think?!

DMTSource commented 7 years ago

Would it be possible to do a much longer run? May I also ask what your learning rate is? Can you try a value one order of magnitude larger(say 1e-3 instead of 1e-4)?

Your Entropy curve looks like it was just about to 'fall off a cliff' and start heading to zero(or wherever it converges). Same for your loss and your reward. It looks like things were only just getting awesome and the training was stopped.

Sometimes the agent can learn something nifty, but lame. That can lead to a sub optimal strategy which can give you the high variance. Imagine a child learning a bad golf swing: they then having to unlearn and then relearn a new swing. Try giving it more time and wait for a convergence across the graphs to show "hey this is where I ended up". If that "over fits" to a bad solution THEN I'd say inspect the graphs for a good stopping point and or go back to the reward system and see what can be improved. You have to imagine what sort of strategies the rewards can lead to because the agent is going to game that system and get away with anything it can.

IbrahimSobh commented 7 years ago

Thank you @DMTSource I do appreciate your comments ..

learning rate: the same as the original code Here are the results after some small time ... it seems the agent is collapsing!! why?!!!!!!

health02

I almost cried :(

DMTSource commented 7 years ago

To my previous point, can an imbalance arise due to the per frame reward? I worry that this is not instructive. Sure stay alive, but what about wandering around? The key is to run from reward to reward Im guessing? So perhaps minimize action by putting a cost(-1) on each movement and (-2) if it does nothing? Something like that might help encourage is to figure out a method to better identify and move to targets as quickly as possible." Might" being the key word there.

Yes the agents seems to run away above. I have been seeing this allot in my own agents and their custom environment. I assume its something like an exploding gradient or some issue as the entropy actually goes to zero. The network will probably nan out after a while. Perhaps @awjuliani can help us with that.

One suggestion is the total reward plot: once again is in the ~20 range. You want to normalize that value back down to 1 like I mentioned in previous posts as it will lead to larger and wonky gradients with large values. Just increase your normalizing factor in the worker function(i think), you dont have to mess with the reward system or anything fancy.

IbrahimSobh commented 7 years ago

Thank you so much @DMTSource

I think the problem of exploding gradient is solved by gradient clipping

grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)

or, do we need to take care of gradients even after the clipping?!

IbrahimSobh commented 7 years ago

@DMTSource @awjuliani

I used skip count = 4, I mentioned here

I got speed up (3X) (300 episode in 30 mins instead of 90 mins)
In the figure below, the maximum length of episode is 2100, in the figure it is 2100/4 skip count = 525
The agent manged to reach the maximum (sometimes at least) after only 150 episode
after 300 episode, the agent average episode length is around 1700 (appears in the figure as 1700/4 = 425)

But again. the agent collapsed after such promising performance!!

doom_health_skip_4

IbrahimSobh commented 7 years ago

here is a long run with skip count = 4

The agent achieves high scores then goes down then again achieves high scores .... and finally, it become an idiot agent!

The yellow marks to indicate that the agent was smart enough to make 2100 episode without dearth.

doom_health_skip_4_longrun

what is going on :(

IbrahimSobh commented 7 years ago

@DMTSource

After making some small changes (based on our discssion above), I got much better reasonable performance, but the agent collapsed at the end!

doom_health_best001

here are the other details ...

doom_health_best001_2

DMTSource commented 7 years ago

Looks great before the collapse! Maybe you can stop the training around 1k when things look good and restart/load the model with a lower learning rate or other changes in the hyperparams to explore ways to avoid the collapse.

Aside from the length, all the plots are showing triangles at the collapse(nan values). I ran into this issue with #27 and have been trying to explore ways to prevent the collapse.

IbrahimSobh commented 7 years ago

Thank you @DMTSource

Thanks for your patience ...

Once and for all, I need to review again how to set:

The clipping value? (40.0) How this will affect the training?

My Understanding:

In Deep Learning, some times we suffer from gradient exploding, the solution is simple: Gradient Clipping. In other words if the gradient is larger then 40.0 make it 40.0 and if smaller than -40.0 make it -40.0. We can know the gradient of trainable variables using:

self.gradients = tf.gradients(self.loss,local_vars)

Moreover, we monitor the Norm of our trainable variables using self.var_norms

self.var_norms = tf.global_norm(local_vars)

By the way, in almost all cases, Var Norm is around 35 to 40 (I think it depends on the network initialization) correct?

Generally speaking; trainable variables are updated as follows:

trainable_variable = trainable variables + (learning rate * gradient)

Conclusion: (I am not sure)

We should update the training variables in a reasonable way:

We should not make very large changes: large gradient and/or larger learning rate
We should not make very small changes: small gradient and/or small learning rate

However, the value of the reward that the agent receives, has a very big effect on the whole training process. Large rewards: Instead of (r / 100.0) I used (r/ 10.0), there was no convergence at all! and gradient Norm was around 3000 ~4000 (very large gradient)

Small rewards: Instead of (r / 100.0) I used (r/ 200.0), there was a delayed convergence. During training, the Gradient Norm was around 15

In both cases, the Var Norm was around 35 ~ 40

The question is: How to set the values for:

Rewards (1.0 or 0.0001, ... )
Clipping (40.0, 10.0, 100.0, ...)

How to make wise decisions based on these plots:

Value loss
Policy Loss
Value Norm
Gradient Norm
Entropy

IbrahimSobh commented 7 years ago

@DMTSource @awjuliani

Your comments are very appreciated

After playing with numbers here is another promising result:

Notes:

The maximum episode length is 525 = 2100/4 (in the figure below)
4 is the skip count
The agent after 700 episode managed to find a policy that keeps the agent alive for around 1900 out of 2100 episode length.

doom_health_best003

but after some time .... the same NaN problem happens ....

doom_health_best003_long_nan

IbrahimSobh commented 7 years ago

@DMTSource @awjuliani

By changing this value from 30 to 3, everything exploded! why?

if len(episode_buffer) == 30 and d != True and episode_step_count != max_episode_length - 1:

to

if len(episode_buffer) == 3 and d != True and episode_step_count != max_episode_length - 1:

I think this is the sequence length for the LSTM layer (correct?), making the sequence too small (3 instead of 30) caused a problem .... why?!

doom_health_lstm_3

My reason:

Having small 3 LSTM sequence means that we have only 3 recurrent layers in time domain, gradients are too large as the sequence is small (gradients usually vanish when we have long sequence) ... correct?

Again, how to adjust all of these parameters?!! rewards, gradients, LSTM length, ....

IbrahimSobh commented 7 years ago

A possible solution for NaN problem, what do you think?

http://stackoverflow.com/questions/33712178/tensorflow-nan-bug

IbrahimSobh commented 7 years ago

After try and error, here is my best result till now

LSTM length = 9 (instead of 30)
Clip = 55.0 (instead of 40.0)
reward = 0.0 except when agent picks a medical kit: reward = 0.0643 (45/700)
skip count = 4

I there a better way to choose these numbers?!!

lstm9_r45by700_clip55

lstm9_r45by700_clip55_episodelength

DMTSource commented 7 years ago

What sort of initialization are you using on your weights? You might get a better answer faster using Xavier initialization, a very popular technique in Tensorflow.. Try setting the initializer/weights_initializer to tf.contrib.layers.xavier_initializer()

As for a "better way", you could perform a Baysean optimization using something like Spearmint to automatically optimize the hyper parameters with multiple concurrent experiments(single or multiple machine if you have the resources) to speed up the convergence of the Gaussian Process model. It would still take allot of time but you would have to do no manual sampling of parameter combinations.

On Apr 25, 2017 8:08 PM, "Ibrahim Sobh" notifications@github.com wrote:

After try and error, here is my best result till now

LSTM length = 9 (instead of 30)

Clip = 55.0 (instead of 40.0)

reward = 0.0 except when agent picks a medical kit: reward = 0.0643 (50/700)

skip count = 4

I there a better way to choose these numbers?!!

[image: lstm9_r45by700_clip55] https://cloud.githubusercontent.com/assets/19908396/25412712/0ea60770-2a25-11e7-8505-07757d11a74f.PNG

[image: lstm9_r45by700_clip55_episodelength] https://cloud.githubusercontent.com/assets/19908396/25412720/15a2e782-2a25-11e7-9fa1-c784b5afeb14.PNG

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awjuliani/DeepRL-Agents/issues/18#issuecomment-297200495, or mute the thread https://github.com/notifications/unsubscribe-auth/AIHXdkK9WetyrqD2WRwpaguStEE9SmtDks5rzosJgaJpZM4MfykN .

IbrahimSobh commented 7 years ago

@DMTSource thanks for the prompt response

I am using the same initialization in the original A3C Doom code

For conv layers (I think the default initialization is used)

self.conv1 = slim.conv2d(activation_fn=tf.nn.elu,
                inputs=self.imageIn,
                num_outputs=16,
                kernel_size=[8,8],
                stride=[4,4],
                padding='VALID')

For Policy and Value, a special function is used for initialization:

            self.policy = slim.fully_connected(rnn_out,a_size,
                activation_fn=tf.nn.softmax,
                weights_initializer=normalized_columns_initializer(0.01),
                biases_initializer=None)

            self.value = slim.fully_connected(rnn_out,1,
                activation_fn=None,
                weights_initializer=normalized_columns_initializer(1.0),
                biases_initializer=None)

Where this function is used by @awjuliani (I wounder why he did not use one of the supported initializers)

#Used to initialize weights for policy and value output layers
def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer

As far as I understand Xavier initialization is not good for RELU, is it better for conv layers (ELU), policy and value?

Did you try it and found it better? (experiments take too long)

Best Regards

DMTSource commented 7 years ago

I agree and have also read you def don't want to use Xavier with rnn's as well. I think the FC layers would benefit the most. I will give it a try in the FC layers only and a learning rate of 1e-3 to see if training can be sped up any.

One user had a great deal of luck on previous tutorials by initializing the FC layers: https://github.com/awjuliani/DeepRL-Agents/pull/32

IbrahimSobh commented 7 years ago

Thank you @DMTSource

For faster results I tried the Doom Basic Scenario. I used Xavier only for policy and value FC layers, as follows:

               self.policy = slim.fully_connected(rnn_out,a_size,
                activation_fn=tf.nn.softmax,
                #weights_initializer=normalized_columns_initializer(0.01),
                weights_initializer=tf.contrib.layers.xavier_initializer(),
                biases_initializer=None)

               self.value = slim.fully_connected(rnn_out,1,
                activation_fn=None,
                #weights_initializer=normalized_columns_initializer(1.0),
                weights_initializer=tf.contrib.layers.xavier_initializer(),
                biases_initializer=None)

Here is the comparison:

Using normalized_columns_initializer as in original code:

basic_org

Using tf.contrib.layers.xavier_initializer:

basic_xavier

Conclusion: In this Doom Basic scenario, there is not big difference, however Xavier episode length looks smoother when converging

Learning rate: Again, in Doom Basic scenario, I tried different learning rates:

1e-4 (this is the original value, converged slowly, after around 300 episodes)
1e-5 (converged slowly, after around 400 episodes)
1e-3 (did not converge even after 300 episode)

Thanks and waiting for your comments

Best Regards

IbrahimSobh commented 7 years ago

@DMTSource

Here is the performance of health gathering scenario (exactly like above, but using Xavier initialization )

health_xavier

health_xavier_episodelength

Compare this with the original code initialization below ...

lstm9_r45by700_clip55_episodelength

Xavier looked promising and smoother, but collapsed!

IbrahimSobh commented 7 years ago

Here is the best/stable/fast result so far: I used labels buffer and label[label> 0] = 255 I concatenated the agent location and health to the flat fully connected layer (after CNN) Skip count = 4 (I want to try 3) Reward for picking medikit = 0.1 (if new_health > old_health after executing action) Reward for being alive = 0.001 (small) LSTM length = 6 (if len(episode_buffer) == 6 and ...)

lstm9_r80by700_clip55_2img_255only_episodelength

but after some time, the agent collapsed as usual (without NaNs)

lstm9_r80by700_clip55_2img_255only_episodelength_all

and after more time ...

lstm9_r80by700_clip55_2img_255only_episodelength_2

why?! :(

GoingMyWay commented 7 years ago

@IbrahimSobh , How is your result now?

Have you tried combinations of actions in the example code of ViZdoom?

In [1]: import itertools as it

In [2]: actions = [list(perm) for perm in it.product([False, True], repeat=3)]

In [3]: actions
Out[3]: 
[[False, False, False],
 [False, False, True],
 [False, True, False],
 [False, True, True],
 [True, False, False],
 [True, False, True],
 [True, True, False],
 [True, True, True]]

In each step, the agent can take a combination of buttons.

IbrahimSobh commented 7 years ago

Hi @GoingMyWay

Mainly, I make the learning rate smaller (1e-4 / 4), and got slower convergence but more stable .... hyper-parameter optimization is a nightmare

no, I did not try this, I think three actions is simpler ...

wolleraudude commented 7 years ago

@IbrahimSobh Are you running on a GPU or CPU? I read of someone having issues with multithreading, GPU & tensorflow ... he solved the problem by running it on CPU.

GoingMyWay commented 7 years ago

@wolleraudude, I ran the code with GPU

with tf.device('/gpu:1'):
    spawning multiple agents

After 10 hours training, the performance collapsed, my scenario is deadly_corridor not healthpack gathering

wolleraudude commented 7 years ago

@GoingMyWay I am running "breakout" from ai gym & mine also collapses. I was running on GPU. I might give it a try on CPU...

IbrahimSobh commented 7 years ago

@wolleraudude @GoingMyWay

I am running on CPU (as A3C paper suggests ans as Arthur code code should run)

awjuliani / DeepRL-Agents

A3C Doom: Health Gathering #18