deadly_corridor move reward

yhcao6 commented 6 years ago

Can I ask what this move reward used for? It seems the move reward will dominate the total reward.

GoingMyWay commented 6 years ago

@yhcao6 Could you please add code to provide more details on this issue?

yhcao6 commented 6 years ago

This is code of "deadly_corridor/agent.py"

move_reward = self.env.make_action(self.actions[a_index], 4)

ammo2_delta = self.env.get_game_variable(GameVariable.AMMO2) - last_total_ammo2
last_total_ammo2 = self.env.get_game_variable(GameVariable.AMMO2)
health_delta = self.env.get_game_variable(GameVariable.HEALTH) - last_total_health
last_total_health = self.env.get_game_variable(GameVariable.HEALTH)
health_reward = self.health_reward_function(health_delta)
ammo2_reward = self.ammo2_reward_function(ammo2_delta)
kill_reward, last_total_kills = self.kills_reward_function(last_total_kills)
reward = move_reward + health_reward + ammo2_reward + kill_reward
episode_reward += reward

I can not understand why set this moving reward, and I observed that it will dominate the total reward, say the reward is 350, may be 340 is moving reward

By the way, can I ask the performance of D3 Battle?

Thanks

GoingMyWay commented 6 years ago

@yhcao6 Well, in the deadly corridor scenario where the agent must go across the corridor to get the armour, so the ViZDoom engine gives a large credit on moving. Yes, it is true that this reward will dominate the total reward, you can set the reward of killing an enemy to mitigate this issue. And I trained A3C on this scenario but failed because arriving the destination is the task of the agent rather than killing enemies, but the agent must kill enemies to make the corridor safe, it is a delimma!

On the performance of D3 battle, it outperforms the average score of human, ranging from 20 to 30 in each episode. Apparently, it is not the best result.

yhcao6 commented 6 years ago

I think I will turn to have a try on D3 Battle.

Could I ask how long it takes to arrive such performance?

I also have a doubt, I see you wrap the environment actions as follow:

If the action is left [1, 0, 0], right [0, 1, 0], shoot [0, 0, 1] Then set actions as follow: [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 1], [0, 1, 1], [0, 0, 0] That is: 3 x 2, since can't move left and right on the same time.

My doubt is that Can the simulator execute left and shoot at the same time?

Thanks!

GoingMyWay commented 6 years ago

@yhcao6 Yes, ViZDoom can act more than one buttons simultaneously. If you train it on 2 TITANs with 32 workers, it may take 10-14 days to get a decent result.

Are you a graduate student at CUHK at Xiaoou Tang's lab?

yhcao6 commented 6 years ago

Do you mean kill 20 to 30 in each episode will need 10-14 days? It is really a long time.

Yeah, I am a student of CUHK and I will graduate this July. I am going to be a research assistant of MML for one year. I am curious why you know that?

GoingMyWay commented 6 years ago

Yes, killing 20-30 enemies. I read your profile on Github. Is MML a college? Maybe I will work as a reseach assistant at Nanyang Technological University. Currently, my research interest is RL.

yhcao6 commented 6 years ago

MML: Multimedia Laboratory.

I also have interest in RL.

GoingMyWay commented 6 years ago

@yhcao6 Cool, it's Xiaoou Tang's lab, it is really competitive to apply for an RA position at his lab.

yhcao6 commented 6 years ago

In fact I am apply for another professor called Dahua Lin. Luckily I did summer research at MML last summer, I think this is the main reason why I can get a position.

GoingMyWay commented 6 years ago

@yhcao6 Great experience, you're so lucky. Let's keep in touch😀

yhcao6 commented 6 years ago

My pleasure!

yhcao6 commented 6 years ago

I just implement A3CLSTM model for defend_the_center by PyTorch, my improvement is

Simplify the action set (only execute 1 action at the same time, remove static action)
Add LSTM layer
Add Frame Normalization

I found the performance is surprising, here is the screenshot of the training curve:

I just use 8 workers and 2 GTX TITAN, it take only 15 minutes and arrive a nice score.

Now I am working on the D3 Battle, your code gives me a lot of inspiration.

By the way, could I ask in D3 Battle, here is the code:

    game.set_labels_buffer_enabled(True)
    game.add_available_button(Button.MOVE_FORWARD)
    game.add_available_button(Button.MOVE_RIGHT)
    game.add_available_button(Button.MOVE_LEFT)
    game.add_available_button(Button.TURN_LEFT)
    game.add_available_button(Button.TURN_RIGHT)
    game.add_available_button(Button.ATTACK)
    game.add_available_button(Button.**SPEED**)
    game.add_available_game_variable(GameVariable.AMMO2)
    game.add_available_game_variable(GameVariable.HEALTH)
    game.add_available_game_variable(GameVariable.USER2)

    game.set_episode_start_time(5)
    game.set_window_visible(self.play)
    game.set_sound_enabled(False)
    game.set_living_reward(0)
    game.set_mode(Mode.PLAYER)
    if self.play:
        game.add_game_args("+viz_render_all 1")
        game.set_render_hud(False)
        **game.set_ticrate(35)**

Could I ask

Can I remove the Button.SPEED action?
What is effect of "game.set_ticrate(35)"

yhcao6 commented 6 years ago

For the end condition, since you set maximum episode length to be 2100, but in your step function:

def step(self, state, sess):
    if not isinstance(sess, tf.Session):
        raise TypeError('TypeError')

    s, game_vars = state
    a_dist, value = sess.run([self.local_AC_network.policy, self.local_AC_network.value], feed_dict={
        self.local_AC_network.inputs: [s],
        self.local_AC_network.game_variables: [game_vars]
    })
    a_index = self.choose_action_index(a_dist[0], deterministic=False)
    if self.play:
        self.env.make_action(self.actions[a_index])
    else:
        self.env.make_action(self.actions[a_index], cfg.SKIP_FRAME_NUM)

    reward = self.reward_function()
    **end = self.env.is_episode_finished()**

    return reward, value, end, a_index

What do you think if I set end to be True in these three conditions:

Health <= 0
Episode length equal to maximum episode length
Ammo2 is 0 (In D3 Battle, could agent gain ammo?)

GoingMyWay commented 6 years ago

@yhcao6 The defend_the_center scenario and the health pack gathering scenarios can be trained within 15 minutes since they are simple games. I will reply you later today since I have to read codes on your question and I am pretty busy now. Sorry about that.

yhcao6 commented 6 years ago

I found in your action combination function for D3 Battle, here is the code:

def button_combinations():
    actions = []
    m_forward = [[True], [False]]  # move forward
    m_right_left = [[True, False], [False, True], [False, False]]  # move right and move left
    t_right_left = [[True, False], [False, True], [False, False]]  # turn right and turn left
    attack = [[True], [False]]
    speed = [[True], [False]]

    for i in m_forward:
        for j in m_right_left:
            for k in t_right_left:
                for m in attack:
                    for n in speed:
                        actions.append(i+j+k+m+n)
    return actions

You think the first action is moving forward, second is moving right, but in fact second is moving backward, I don't know if this is a bug in VizDoom, could you have a check?

I just have a try on training D3 Battle, here is the curve, I use 8 workers,

There is a jump, I think the agent learns to kill monsters to gain reward, but later increasing is very slowly, I found the agent just turn left and right, that is, the agent prefer to stay in one room instead to explore new rooms. If no monster appearing, agent seems don't know what to do.

Is there is some good method to mitigate this problem?

Is this the limitation for A3C algorithm, when the reward is sparse and observation space is larger, it seems it is difficult for it to converge rapidly.

GoingMyWay commented 6 years ago

@yhcao6

You can remove the Button.SPEED action
It is Sets ticrate for ASNYC Modes - number of tics executed per second, as you can see here

GoingMyWay commented 6 years ago

What do you think if I set end to be True in these three conditions: Health <= 0 Episode length equal to maximum episode length Ammo2 is 0 (In D3 Battle, could agent gain ammo?)

These are engineering settings for training, if it's useful for training, you can set them. There are many kinds of ammo, the agent can pick ammo, you can play this game to check this.

GoingMyWay commented 6 years ago

In the code

        game.add_available_button(Button.MOVE_FORWARD)
        game.add_available_button(Button.MOVE_RIGHT)
        game.add_available_button(Button.MOVE_LEFT)
        game.add_available_button(Button.TURN_LEFT)
        game.add_available_button(Button.TURN_RIGHT)
        game.add_available_button(Button.ATTACK)
        game.add_available_button(Button.SPEED)

So, I set the button by the order of above code.

yhcao6 commented 6 years ago

I know you add these buttons, but still the second button is moving backward, please have a check

GoingMyWay commented 6 years ago

There is a jump, I think the agent learns to kill monsters to gain reward, but later increasing is very slowly, I found the agent just turn left and right, that is, the agent prefer to stay in one room instead to explore new rooms. If no monster appearing, agent seems don't know what to do.

It is vey common as you posted, the agent has no idea of the whole environment, and it doesn't explore the secnario and it seems that the agent has no memory on the route.

For improving the performance of this agent, you can read papers of Winners, the homepage, maybe you should search these papers on Google Scholar.

GoingMyWay commented 6 years ago

@yhcao6

Could you please provide more details on the button issue? Do you mean action[1] is actually the moving backward button？

yhcao6 commented 6 years ago

Yes, I don't know why is that

GoingMyWay commented 6 years ago

@yhcao6 Hi, I am sorry, there is something wrong with ViZDoom on my Mac, and I can't test your issue right now, I have an idea that can test whether the second button is moving backward button, you can change the code basic example

so that every time, the agent acts action[1]

r = game.make_action(action[1])

And you can see the game render to test it.

yhcao6 commented 6 years ago

In fact I have re combine the actions, and I tuning the reward a little, I want to say the performance is surprising, I only use 7 training processes due to the cse department limitation, but see the curve:

I only train for around 3 hours, and when testing, I always choose action with maximum probability distribution, the result is aboving 20:

GoingMyWay commented 6 years ago

@yhcao6 very good result.

GoingMyWay commented 6 years ago

@yhcao6 Hi, how is your training now?

yhcao6 commented 6 years ago

I introduce extra distance reward to encourage the agent to move forward instead turn around in same place, now it will move all the time. If I set episode length to be infinite, it seems the agent will never die... I don't think the agent know that he should to try new rooms to seek new enemy, but just if there nothing appear in the screen, then he will keep going move to gain distance reward.

By the way, I found it is really difficult for agent to learn that when the ammo is 0, he should pick up ammo and when his health is low, he should pick up medical box. Since the only input is the sensor image. It seems the agent can not learn that he should pick up something but just by accident he just pick up something.

GoingMyWay / ViZDoomAgents

deadly_corridor move reward #4