Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.18k stars 4.16k forks source link

[Possible Bug] Reward doesnt kick in if "done" && "reward" is set in OnCollisionEnter #156

Closed batu closed 6 years ago

batu commented 6 years ago

I was using something similar to the following code:

 private void OnCollisionEnter(Collision collision) {
        if (collision.collider.gameObject.tag == "Target") {
            print("Collided with the target");
           reward += 1;
           done = true;
        }
        if (collision.collider.gameObject.tag == "Walls") {
            print("Collided with the wall");
           reward += -1;
           done = true;
        }
}

I realized that the rewards werent behaving properly. Which makes sense as it this is decoupled from agent step.

Writing it here as some other might tumble into a similar implementation or can be mentioned in some part of the docs.

cheers, batu

awjuliani commented 6 years ago

HI @batu,

This is strange behavior. We use OnCollisionEnter for multiple environments to set the reward and done flags, and they behave appropriately (see the Tennis environment, for example). Is there some specific behavior you are finding with rewards, such as never registering, or partially registering?

batu commented 6 years ago

I stopped using this specific implementation but I will go back and try to replicate the issue. I might have been mistaken, if that is the case, I will just delete the ticket.

crazymuse commented 6 years ago

I also felt some problem with this implementation. It seems the reward -1 or +1 is coming in the first frame after reset not with the current observation where collision happened. Is this the expected behavior ?

vincentpierre commented 6 years ago

I am trying to reproduce this bug but I do not succeed. I set up an environment in which the agent is a rigidbody I throw on a block along the z axis (I start at z = 100 and decrease). I have this method in the agent class :

 public void OnCollisionEnter(Collision collision)
    {
            done = true;
            reward = 1;
    }

The environment behaves as expected in this scenario. I observe from the python side :

2 steps before reset :
state : [ 0.          0.          4.64177656]
done = False
reward : -1.0
step right before the environment resets : 
state : [ 0.        0.        4.365374] 
done = False
reward = -1.0
<-- the object is close enough to trigger collision
step right after reset
state : [   0.    0.  100.]
done = True
reward = 1.0
2 steps after reset :
state : [  0.       0.      99.9996]
done : False
reward : -1.0

This is the behavior that is expected, in the case of an academy reset, the environment needs user input (call to env.reset) in order to reset. When agents reset, we cannot stop the execution for the other agents, so we ONLY reset the agent. In the step right after reset, the state is the state after reset but the reward and done flag correspond to the consequences of the action right before reset.
I hope this makes sense.

batu commented 6 years ago

@vincentpierre I looked further into this and I think my confusion arises from not fully understanding what the expected behavior is, when things are resetting etc. Not necesarily related to OnCollisionEnter();

I made this toy case: cummulaivereward A reward of +1 is added when the sphere touches the green wall and done is set to true. We see in the GIF the following:

Initially, the cumulative reward is zero as checked in AgentStep();

The main questions I have are these:

Parameters and code notes:

Academy:

Player:

AgentOnDone is empty.

Here are the code bits I use to increment the rewards and Debug:

private void OnCollisionEnter(Collision collision) {
    if (collision.collider.gameObject.tag == "Target") {
        reward += 1;
        done = true;
    }
}
public override void AgentStep(float[] act) {
    Debug.Log(string.Format("In in SteeringAgent.cs step and the cummulative reward is:{0}", CumulativeReward));
    //Movement logic 
}
public override void AgentReset(){
    Debug.Log(string.Format("Agent reset called in SteeringAgent.cs and the cummulative reward is:{0}", CumulativeReward));
    //Position Rotation Velocity reset logic
}

Thank you very much and sorry for the code dump.

vincentpierre commented 6 years ago

@batu Ok, help me with this one: (Edited) Can you try in Agent.cs to replace line 198.

    /// Do not modify : Is used by the brain to reset the agent.
    public void Reset()
    {
        memory = new float[brain.brainParameters.memorySize];
        CumulativeReward = 0f;
        stepCounter = 0;
        AgentReset();
    }

with

    public void Reset()
    {
        memory = new float[brain.brainParameters.memorySize];
        // CumulativeReward = 0f; <-- This is a modification
        stepCounter = 0;
        AgentReset();
    }

And also line 242:

    /// Do not modify : Is used by the brain to reset the Reward.
    public void ResetReward()
    {
        reward = 0;
    }

with

    /// Do not modify : Is used by the brain to reset the Reward.
    public void ResetReward()
    {
        reward = 0;
    CumulativeReward = 0f;
    }

And tell me if the behavior is closer to what you expect. That would help me. I think it would make most sense to look at the Cumulative reward in the CollectState() function in you agent class since it is called when the information / rewards are collected. Thank you ! Also, can you use external brain to tell me if you get expected reward from the python side ? Also also, sorry for multiple edits.

batu commented 6 years ago

Thank you for looking into this @vincentpierre! I seem to believe there is no "bug" but rather a misunderstanding when it comes to what is expected.

This is the behavior BEFORE making the changes: beforechanges

This is the behavior AFTER making the changes: afterchanges

The changes make it fit my expectations. The collision happens, the cumulative reward is incremented up to 1 for that frame, The agent resets, the cumulative reward is back to 0 versus my earlier experience.

Here is a link to my project snapshot repo if you want to try replicating it. And here are the results for the training in my slightly modified PPO.py and the actual SteeringAgent.cs

I will look into getting the individual reward in python now.

batu commented 6 years ago

@vincentpierre I was playing around with the environment a bit more and I added a minus reward at every step. I realized that in CollectState reward and cumulative reward seems to be identical. Can you confirm whether this is the expected behavior?

reward_cumulativereward

Here is the corresponding debug statement: capture

Here is the snapshot if you want to replicate the issue.

batu commented 6 years ago

@vincentpierre sorry for the many pings but just as a note this is not affecting the training. The training is going smoothly on ppo side.

vincentpierre commented 6 years ago

@batu This is definitely not expected behavior. I am looking into the code, the problem is that cumulativeReward is not used by the external Brain. Revert the changes I told you to make, it does not work at all. I am testing something out, it is kind of a hack but I think it solves the problem: Try line 193 :

    /// Do not modify : Is used by the brain to reset the agent.
    public void Reset()
    {
        memory = new float[brain.brainParameters.memorySize];
        stepCounter = 0;
        CumulativeReward = 0f;
        AgentReset();
        CumulativeReward = -reward; // <--This is the modification
    }

The issue is that the process goes like :

batu commented 6 years ago

@vincentpierre, In the end, I just ended up tracking my own cumulativeReward (and decrementing reward at reset call similar to what you are suggesting) but overall the training is going smooth! About to test the curriculum stuff. I will give you feedback on how it goes.

vincentpierre commented 6 years ago

I am sorry to hear you implemented your own, but at least I am glad that training is going smooth. Could you try and see if your own implementation of cumulative reward is consistent with the one in Agent.cs with the fix above ? That would help. Thank you.

batu commented 6 years ago

@vincentpierre No worries, it is fun to tinker with a tool I use a fair bit. I will implement the changes and get back to you tomorrow.

stalkermustang commented 6 years ago

Hey @awjuliani @vincentpierre , i have simmilar problem, any timing to fix this in master repo? Ideas to solve problem localy (to train to unity MLAgents Challenge )?

jglazman commented 6 years ago

This could be totally unrelated, but you should be careful to mind the order of events if you set reward and done inside of OnCollisionEnter(). Remember that AgentStep() is called from FixedUpdate() which happens immediately BEFORE Unity executes physics updates.

Just to clarify, looking at the v0.2 code the order goes like this (when using resetOnDone = true):

  1. Academy.FixedUpdate()
    1. Brain.SendState() // sends reward to the brain
    2. if done then
      1. Reset()
    3. else
      1. agent.reward = 0
      2. agent.done = false
      3. Brain.DecideAction()
      4. AcademyStep()
      5. if !agent.done then
        1. AgentStep()
  2. Agent.OnCollisionEnter()

So you have to be careful not to overwrite your reward and done values in OnCollisionEnter if you are also setting them from AgentStep. Also note that these values aren't processed until the next frame.

The only part I don't understand is how reward is used when you set done=true, since it appears to be discarded in Reset() and never sent to the brain. [edit: nevermind, in external mode it sends reward to the brain before resetting]

vincentpierre commented 6 years ago

For those still looking into the Cumulative rewards, a couple of notes: Cumulative reward is not used by Python (It is not even sent), it is just meant to help developers. We are still trying to figure out how to make the cumulative reward be as accurate as possible. This method seems to be working but we implemented a cleaner version in the branch development-0.3 that consists of modifying the SetCumulativeReward method:

public void SetCumulativeReward()
    {
        if (!done)
        {
            CumulativeReward += reward;
        }
        else
        {
            CumulativeReward = 0f;
        }
    }

and removing modifications of the CumulativeReward in the Reset() method.

Thank you @jglazman for these clarification on OnCollisionEnter. I suppose if you set done to true in this method, the agent would reset in the next FixedUpdate, not the current one. This is important to keep in mind.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.