Farama-Foundation / ViZDoom

Reinforcement Learning environments based on the 1993 game Doom :godmode:
https://vizdoom.farama.org/
1.73k stars 399 forks source link

Reward implementation for DeadlyCorridor #441

Closed juice1000 closed 2 years ago

juice1000 commented 4 years ago

Hi guys!

I'm using Vizdoom for my bachelor thesis experiments and trained the agents in DeadlyCorridor. I implemented a probability calculator for choosing the action "Attack" which is ridiculously low.

I wanted to ask for any updates regarding the reward implementation. Is there really just a reward for the delta of the distance between agent and goal and the agent being killed or does he also get:

  1. A negative reward for shooting and not hitting a monster? and/ or
  2. A positive reward for shooting down a monster?

A look inside the .wad file didn't give me a clear insight about the reward implementation, so what is currently set as a reward?

Miffyli commented 4 years ago

The README on this particular scenario seems to be bit misleading. Here's what the reward is based on the .wad and .cfg file:

Since there is no reward for killing enemies, standard RL algorithms will likely have trouble learning to shoot the enemies (it has to learn to connect enemies -> getting killed -> end of episode -> no more delicious reward). This is probably why you do not see much attacking going on. The algorithm also has to scale the reward to something smaller, otherwise learning will take time or be unstable (neural networks do not like big numbers).

If you are allowed to simplify the env, you could use KILLCOUNT GameVariable to track and reward killing enemies. I also suggest you divide all rewards from the environment with 1000.

Edit/Sidenote: If everything else fails, and your agent is not able to improve in e.g. basic or health_gathering scenarios, the issue might be in the algorithm implementation. In such case I recommend testing the environments with known-to-work implementations, e.g. stable-baselines (disclaimer: I am part of the developing team of that repo).

juice1000 commented 4 years ago

Hi Miffiyli! These insights were really helpful! Also very honored to be able to talk to somebody from a higher level of expertise in RL.

I indeed forked some code and created single-CPU A2C, A3C and multi-CPU A2C. Currently, my algorithms don't get higher than 500, on average they converge at 200 which means there is still a lot to improve! I will consider your suggestions to improve the reward performance :)

Acejoy commented 9 months ago

The README on this particular scenario seems to be bit misleading. Here's what the reward is based on the .wad and .cfg file:

  • Reward per step is the current x coordinate of the player. Player starts at x-coordinate 0, and end of the tunnel is ~1300.
  • If player reaches the goal, it gets additional +1000 reward (on top of the "x-coordinate" reward)
  • If player dies, it gets -100 reward (possibly on top of the x-coordinate reward).

Since there is no reward for killing enemies, standard RL algorithms will likely have trouble learning to shoot the enemies (it has to learn to connect enemies -> getting killed -> end of episode -> no more delicious reward). This is probably why you do not see much attacking going on. The algorithm also has to scale the reward to something smaller, otherwise learning will take time or be unstable (neural networks do not like big numbers).

If you are allowed to simplify the env, you could use KILLCOUNT GameVariable to track and reward killing enemies. I also suggest you divide all rewards from the environment with 1000.

Edit/Sidenote: If everything else fails, and your agent is not able to improve in e.g. basic or health_gathering scenarios, the issue might be in the algorithm implementation. In such case I recommend testing the environments with known-to-work implementations, e.g. stable-baselines (disclaimer: I am part of the developing team of that repo).

Hello, I was trying to train an agent in the deadly_corridor scenario. I tried the following things:

  1. Reward Shaping( added rewards for killing adversaries, reducing damage taken and reducing unnecessary ammo usage)
  2. Used curriculum learning( gradually increased level of doom using doom_skill=1(easiest) to 5(hardest) variable in cfg file). Trained levels 1 to 5 for timesteps( 480K steps each).

Also used stable_baselines3 PPO as policy.

Still the agent is not learning anythong( not reaching the end).

Would like to get any suggestions from you.

Thanks.