Closed MDReptile closed 6 years ago
I'm not totally sure, but I think this might have been because too many agents were getting 0 overall reward for entire runs, and it created a divide by zero. I'm guessing anyway, since I've changed rewarding around to make sure agents get either positive or negative rewards in more cases.
Advantage value is actually Q(s, a) - V(s)
, so unless your value function always outputs zero, even if the rewards are all zeros, it is not going to be zero. The problem is what you're dividing is its std, so if all the rewards are zeros and they evaluate on the same state, it could be zero. However this means your agent doesn't move at all...
I will suggest that you print out the advantage value to see what happened..., also personally I didn't normalize the advantage and the agent still learnt very well.
How can I make sure the std doesn't result in 0? Just give a minor reward every so often for being alive?
Yes but I'm more worried about the fact that it is zero, because theoretically it shouldn't if your agent runs correctly.
I can only think of two ways that it's zero:
I think you need to check out why if it's either of these cases first
I have a "dynamic agent setup" where I'm testing 10 agents in the scene, and the game spawns 10 random characters and assigns each to a agent.
The characters are on two teams, and randomly they kill some of the other agents characters, so the agent becomes "inactive" waiting for the round to end to have the game spawn more characters.
Perhaps this time where I don't do anything with input, and give all 0 floats as state, causes an issue? You can see this in BrainzAgent.cs near the top of FixedUpdate() that if the agent doesn't have a character assigned it gets all 0 floats. I also am not rewarding the agent during this time (they are rewarded or punished during the time they are alive).
I run the rounds at 60 seconds, if that matters at all.
Not a good idea?
Well if the agent becomes inactive just don't train on that agent?
How do I stop the training temporarily on them?
EDIT: Would basically solve both my issues perhaps: https://github.com/Unity-Technologies/ml-agents/issues/228
Yeah I still hit this warning (which stops my agents from doing anything) every so often.
@kwea123 you said "The advantage is evaluated on the same state, and the discounted reward is the same, which means your agent doesn't move."
By this you mean the agent hasn't moved the character? In testing characters are moving with the actions, as long as the character is alive, so I'm not sure what to check. I am trying small rewards for movement and negative rewards for staying still... so they never keep 0 (or very rarely) and it still seems to cause this problem sometimes. The only time they don't get rewards (or have actions applied to anything) is if the character associated with the agent has died.
And you said "There is only one value in the buffer, which means your agent is done and resets after one frame."
I am not sure what I could do to check this either, forgive me I'm new to ML in general. By this you mean something is resetting the agent when it shouldn't be?
If it helps at all, here is what the agents are doing, with a debug display showing each agents team/cumulative reward - red and gray are human team, green is zombie team, and cyan are dead agents
I'd even package up the project and send that to you, if you think you could get to the bottom of it!
Yeah if the character is dead, then
if (cControls == null || ActiveCharacter == null)
{
// blank info, till respawned
for(int i = 0; i < 24; i++)
state.Add(0);
The states are all zeros until it respawns, you need to stop training on this agent! (I don't know how though..) Otherwise it's the situation that I mentioned
The advantage is evaluated on the same state, and the discounted reward is the same, which means your agent doesn't move.
Also I doubt if the agents will learn correctly, since this is a dynamic environment. Basically because agents try to do their best in response to their enemies (and their teammates), so the strategies interact with each other. PPO doesn't guarantee any convergence. You need to use algorithms that allow training on multiple agents (multi-agent reinforcement learning, MARL). You can find some information here https://github.com/LantaoYu/MARL-Papers
And as far as I know, MADDPG https://arxiv.org/pdf/1706.02275.pdf should be the only one currently having a clearly written pseudo code
Ahh yes, it must be because I was returning all those zeroes it looks like. At least I've been able to run it for a couple hundred thousand steps and not hit that warning after returning -1 instead. I'm happy to see I haven't broken something else.
I'll have to do some more research about ML in general and the stuff you shared to see if I can come up with an alternative way to train them, but I might be getting in a little over my head writing an alternative to the PPO that comes with the examples. Thanks for all the help with everything @kwea123 !
Thanks for reaching out to us. Hopefully you were able to resolve your issue. We are closing this due to inactivity.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Forgive me ahead of time for this huge mess of prototype code that won't format properly here (probably lots of problems)! I've chopped the code down to what I think is relevant here, which is the agent script and the controller script where it rewards and punishes the agent.
Sometimes I'm getting this error message in the jupyter notebook PPO, as I'm training, and then the agents stop working (no longer giving input):
c:\UnityProjects\DeepBrainz\python\ppo\trainer.py:163: RuntimeWarning: invalid value encountered in true_divide self.training_buffer['advantages'] = (advantages - advantages.mean()) / advantages.std()
Here is how I have this agent set up:
public class BrainzAgent : Agent { private CharacterControls cControls;
}
and the rewards are given through this script for controlling the simple characters:
[RequireComponent(typeof(Rigidbody))] public class CharacterControls : MonoBehaviour { public GameObject WeaponPrefab, BulletPrefab; public Transform FirePos; public bool PlayerControlled = false; public bool HasWeapon = false; public bool IsHuman = true; private BrainzAgent agent;
}
Any ideas what is wrong? I haven't modified the PPO really, just added a name and set the training steps.