Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.99k stars 4.15k forks source link

trainer.py:163: RuntimeWarning: invalid value encountered in true_divide #230

Closed MDReptile closed 6 years ago

MDReptile commented 6 years ago

Forgive me ahead of time for this huge mess of prototype code that won't format properly here (probably lots of problems)! I've chopped the code down to what I think is relevant here, which is the agent script and the controller script where it rewards and punishes the agent.

Sometimes I'm getting this error message in the jupyter notebook PPO, as I'm training, and then the agents stop working (no longer giving input):

c:\UnityProjects\DeepBrainz\python\ppo\trainer.py:163: RuntimeWarning: invalid value encountered in true_divide self.training_buffer['advantages'] = (advantages - advantages.mean()) / advantages.std()

Here is how I have this agent set up:


public class BrainzAgent : Agent { private CharacterControls cControls;

public bool actionLeft = false, actionRight = false, 
    actionForward = false, actionBack = false, actionAttack = false;

public GameObject ActiveCharacter;

public override void InitializeAgent()
{
    if(ActiveCharacter != null)
        cControls = ActiveCharacter.GetComponent<CharacterControls>();
}

List<float> state = new List<float>();

// one-hots
public bool RayL1Hit, ... RayR5Hit;
public float RayL1Dist, ... RayR5Dist;

public override List<float> CollectState()
{
    // 13 + 11 = 24 total state vars
    state.Clear();

    if (cControls == null || ActiveCharacter == null)
    {
        // blank info, till respawned
        for(int i = 0; i < 24; i++)
            state.Add(0);

        if (ActiveCharacter != null)
        {
            cControls = ActiveCharacter.GetComponent<CharacterControls>();
            done = true; // cause reset
        }
    }
    else
    {
        if (cControls.IsHuman) { state.Add(1); }
        else { state.Add(0); }
        if (cControls.HasWeapon) { state.Add(1); }
        else { state.Add(0); }
        if (RayL1Hit) { state.Add(1); }
        else { state.Add(0); }
       // ..... cut for space
        if (RayR5Hit) { state.Add(1); }
        else { state.Add(0); }

        state.Add(RayL1Dist);
        // ..... cut for space
        state.Add(RayR5Dist);
    }

    return state;
}

public override void AgentStep(float[] action)
{
    switch ((int)action[0])
    {
        case 0: // do nothing
            actionRight = false;
            actionLeft = false;
            actionForward = false;
            actionBack = false;
            actionAttack = false;
            break;
         //    .. unrelated variants of above and below
        case 5: // attack
            actionRight = false;
            actionLeft = false;
            actionForward = false;
            actionBack = false;
            actionAttack = true;
            break;
        default:
            return;
    }
}

public void Fail(bool zombieDeath)
{
    //Debug.Log("Fail");
    reward -= 1f; // failure
    if(zombieDeath) // character is completely dead
        ActiveCharacter = null;
    done = true; // reset will be called
}

public void GiveReward(int points)
{
    //Debug.Log("success");
    reward += 0.1f * points; // success
}

public override void AgentReset()
{
    // called after done becomes true
    //Debug.Log("Reset Agent");
}

}


and the rewards are given through this script for controlling the simple characters:


[RequireComponent(typeof(Rigidbody))] public class CharacterControls : MonoBehaviour { public GameObject WeaponPrefab, BulletPrefab; public Transform FirePos; public bool PlayerControlled = false; public bool HasWeapon = false; public bool IsHuman = true; private BrainzAgent agent;

void FixedUpdate()
{
    if(agent != null)// AI controlled
    {
        // RAYCASTING ----------------------

        // INPUT ----------------------------
        // use input from agent
        if (agent.actionForward)
            rb.AddForce(transform.forward * movePower);
        else if (agent.actionBack)
            rb.AddForce(-transform.forward * movePower);

        if (agent.actionLeft)
            transform.Rotate(new Vector3(0, -turnRate, 0));
        else if (agent.actionRight)
            transform.Rotate(new Vector3(0, turnRate, 0));

        if (IsHuman && HasWeapon)
        {
            if (agent.actionAttack)
            {
                AttemptFireWeapon();
            }
        }
        else if (!IsHuman) // player controlled zombie
        {
            if (agent.actionAttack)
            {
                AttemptZombieMelee();
            }
        }
    }

    if (keepUpright)
        KeepUpright();
}

public void HitEnemy()
{
    // hit an enemy with bullet
    if(!PlayerControlled)
        agent.GiveReward(10);
}

public void SetAgent(BrainzAgent a) // suspended agent is assigned to newly created character
{
    agent = a;
}

float lastFireTime, fireDelay = 0.5f;

void AttemptFireWeapon()
{
    if (!HasWeapon)
        Debug.LogError("Call AttemptFireWeapon with HasWeapon false!");

    if (Time.time - lastFireTime > fireDelay) // if enough time has passed since last shot
    {
        GetOrCreatePoolBullet(transform.forward);
        lastFireTime = Time.time;
    }
}

void AttemptZombieMelee()
{
    if (!IsHuman && Time.time - lastFireTime > fireDelay) // if enough time has passed since last attack
    {
        MeleeAttack();
        lastFireTime = Time.time;
    }
}

void MeleeAttack()
{
    Ray shotRay = new Ray(FirePos.position, transform.forward);
    RaycastHit hitInfo = new RaycastHit();
    if (Physics.Raycast(shotRay, out hitInfo, 1.5f))
    {
        if (hitInfo.collider.CompareTag("Human"))
        {
            // hit a human
            CharacterControls humanHit = hitInfo.collider.gameObject.GetComponent<CharacterControls>();
            // damage or kill the human
            humanHit.TakeHit(Random.Range(1, 3)); // take 1 or 2 damage (3 exclusive)
            agent.GiveReward(10);
        }
    }
}

public void TakeHit(int damage)
{
    lifeLeft -= damage;
    if (lifeLeft <= 0)
    {
        // dead character
        Die();
    }
}

void Die()
{
    // if human become zombie, if zombie, die completely
    if (IsHuman)
        GameMaster.Instance.CharactersH.Remove(this.gameObject);
    else
        GameMaster.Instance.CharactersZ.Remove(this.gameObject);

    if (IsHuman)
    {
        RemovePoolBullets();
        if (PlayerControlled)
        {
            BecomeZombie();
        }
        else
        {
            agent.Fail(false);
            BecomeZombie();
        }
        GameMaster.Instance.CharactersZ.Add(this.gameObject);
    }
    else // already zombie
    {
        if (!PlayerControlled)
            agent.Fail(true);
        keepUpright = false;
        Splode(); // kill the character and suspend the agent temporarily
    }
}
// ... non-related code

}


Any ideas what is wrong? I haven't modified the PPO really, just added a name and set the training steps.

MDReptile commented 6 years ago

I'm not totally sure, but I think this might have been because too many agents were getting 0 overall reward for entire runs, and it created a divide by zero. I'm guessing anyway, since I've changed rewarding around to make sure agents get either positive or negative rewards in more cases.

kwea123 commented 6 years ago

Advantage value is actually Q(s, a) - V(s), so unless your value function always outputs zero, even if the rewards are all zeros, it is not going to be zero. The problem is what you're dividing is its std, so if all the rewards are zeros and they evaluate on the same state, it could be zero. However this means your agent doesn't move at all...

I will suggest that you print out the advantage value to see what happened..., also personally I didn't normalize the advantage and the agent still learnt very well.

MDReptile commented 6 years ago

How can I make sure the std doesn't result in 0? Just give a minor reward every so often for being alive?

kwea123 commented 6 years ago

Yes but I'm more worried about the fact that it is zero, because theoretically it shouldn't if your agent runs correctly.

I can only think of two ways that it's zero:

  1. The advantage is evaluated on the same state, and the discounted reward is the same, which means your agent doesn't move.
  2. There is only one value in the buffer, which means your agent is done and resets after one frame.

I think you need to check out why if it's either of these cases first

MDReptile commented 6 years ago

I have a "dynamic agent setup" where I'm testing 10 agents in the scene, and the game spawns 10 random characters and assigns each to a agent.

The characters are on two teams, and randomly they kill some of the other agents characters, so the agent becomes "inactive" waiting for the round to end to have the game spawn more characters.

Perhaps this time where I don't do anything with input, and give all 0 floats as state, causes an issue? You can see this in BrainzAgent.cs near the top of FixedUpdate() that if the agent doesn't have a character assigned it gets all 0 floats. I also am not rewarding the agent during this time (they are rewarded or punished during the time they are alive).

I run the rounds at 60 seconds, if that matters at all.

Not a good idea?

kwea123 commented 6 years ago

Well if the agent becomes inactive just don't train on that agent?

MDReptile commented 6 years ago

How do I stop the training temporarily on them?

EDIT: Would basically solve both my issues perhaps: https://github.com/Unity-Technologies/ml-agents/issues/228

MDReptile commented 6 years ago

Yeah I still hit this warning (which stops my agents from doing anything) every so often.

@kwea123 you said "The advantage is evaluated on the same state, and the discounted reward is the same, which means your agent doesn't move."

By this you mean the agent hasn't moved the character? In testing characters are moving with the actions, as long as the character is alive, so I'm not sure what to check. I am trying small rewards for movement and negative rewards for staying still... so they never keep 0 (or very rarely) and it still seems to cause this problem sometimes. The only time they don't get rewards (or have actions applied to anything) is if the character associated with the agent has died.

And you said "There is only one value in the buffer, which means your agent is done and resets after one frame."

I am not sure what I could do to check this either, forgive me I'm new to ML in general. By this you mean something is resetting the agent when it shouldn't be?

If it helps at all, here is what the agents are doing, with a debug display showing each agents team/cumulative reward - red and gray are human team, green is zombie team, and cyan are dead agents

I'd even package up the project and send that to you, if you think you could get to the bottom of it!

kwea123 commented 6 years ago

Yeah if the character is dead, then

if (cControls == null || ActiveCharacter == null)
    {
        // blank info, till respawned
        for(int i = 0; i < 24; i++)
            state.Add(0);

The states are all zeros until it respawns, you need to stop training on this agent! (I don't know how though..) Otherwise it's the situation that I mentioned

The advantage is evaluated on the same state, and the discounted reward is the same, which means your agent doesn't move.

Also I doubt if the agents will learn correctly, since this is a dynamic environment. Basically because agents try to do their best in response to their enemies (and their teammates), so the strategies interact with each other. PPO doesn't guarantee any convergence. You need to use algorithms that allow training on multiple agents (multi-agent reinforcement learning, MARL). You can find some information here https://github.com/LantaoYu/MARL-Papers

And as far as I know, MADDPG https://arxiv.org/pdf/1706.02275.pdf should be the only one currently having a clearly written pseudo code

MDReptile commented 6 years ago

Ahh yes, it must be because I was returning all those zeroes it looks like. At least I've been able to run it for a couple hundred thousand steps and not hit that warning after returning -1 instead. I'm happy to see I haven't broken something else.

I'll have to do some more research about ML in general and the stuff you shared to see if I can come up with an alternative way to train them, but I might be getting in a little over my head writing an alternative to the PPO that comes with the examples. Thanks for all the help with everything @kwea123 !

vladimiroster commented 6 years ago

Thanks for reaching out to us. Hopefully you were able to resolve your issue. We are closing this due to inactivity.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.