Different training results when normalizing inputs

kkalera commented 3 years ago

I've been testing some theories that I had and stumbled on something that I don't really understand and might be a bug. In de docs, it is stated that for the best performance, one should normalize the observations between -1 and 1 or 0 to 1. However, doing that is making the agent train slower. More information below;

I'll write more details about the environment, inputs, etc.. below, but here are the results: input-results

The environment:

I created the environment based on the one that's created in the docs
The agent lives on a square of 10x10
The agent spawns on 0,0,0 and te target spawns on a random location
Reward for reaching the target is +1, no negative rewards are used.
If the agents falls of the platform, or reaches the maximum amount of steps, the episode is ended.

Inputs:

Location of the agent
Location of the target
Velocity of the agent

Every training session was done using the same hyperparameters, the same run-seed in the same environment. The model was trained using only a single agent voor 500k steps.

Here's what I changed between runs:

Orange (ppo-raw-input) : Nothing done to the inputs, just raw location Vector3 and Velocity x and z values.
Dark blue (ppo-raw-input-normalized): Normalized all inputs between 0 and 1. Result: no convergence.
Dark red (ppo-raw-input-normalizedx10): Normalized all inputs between 0 and 10. Result: Converging slower.
Light Blue (ppo-raw-input-normalized-1to1): Normalized all inputs between -1 and 1. Result: Converging faster than 0 to 10, but slower than the baseline
Pink (ppo-raw-input-normalized-10to10): Normalized all inputs between -10 and 10. Result: Converging on par with baseline when taking randomness of the target spawning into account.

note: All normalization was done using the same function that normalizes between 0 and 1. To get to the normalization between -10 and 10 for example; the inputs were normalized between 0 and 1, then multiplied by 20 and discounted from 10.

Package information:

Unity Version: Unity 2020.3.7f.1
OS + version: Windows 10
ML-Agents version: 1.9.1-preview
Torch version: 1.7.1+cu110

kkalera commented 3 years ago

Update:

I've tried the same scenario using the ML-agents 18 release with python package 0.27.0 and am getting similar results.

This time using one of the provided environments, 3DBall. Below are the results:

3DBall results

While the normalized values between 0 and 1 don't have such a dramatic effect on learning as in my custom environment, there is a clear difference in learning speed.

note: All training runs were performed with the provided config file "config/ppo/3DBall.yaml". The only alteration was in the processing of the inputs.

Processing of the inputs looks like this:

    public override void CollectObservations(VectorSensor sensor)
    {
        if (useVecObs)
        {
            /*sensor.AddObservation(gameObject.transform.rotation.z);
            sensor.AddObservation(gameObject.transform.rotation.x);
            sensor.AddObservation(ball.transform.position - gameObject.transform.position);
            sensor.AddObservation(m_BallRb.velocity);*/

            /*sensor.AddObservation(Normalize(gameObject.transform.rotation.z, 0, 360));
            sensor.AddObservation(Normalize(gameObject.transform.rotation.x, 0, 360));
            sensor.AddObservation(NormalizeVector3(ball.transform.position - gameObject.transform.position, -3, 3));
            sensor.AddObservation(NormalizeVector3(m_BallRb.velocity, -10, 10));*/

            /*sensor.AddObservation(Normalize(gameObject.transform.rotation.z, 0, 360) * 10);
            sensor.AddObservation(Normalize(gameObject.transform.rotation.x, 0, 360) * 10);
            sensor.AddObservation(NormalizeVector3(ball.transform.position - gameObject.transform.position, -3, 3) * 10);
            sensor.AddObservation(NormalizeVector3(m_BallRb.velocity, -10, 10) * 10);*/

            /*sensor.AddObservation(1 - Normalize(gameObject.transform.rotation.z, 0, 360) * 2);
            sensor.AddObservation(1 - Normalize(gameObject.transform.rotation.x, 0, 360) * 2);
            sensor.AddObservation(Vector3.one - NormalizeVector3(ball.transform.position - gameObject.transform.position, -3, 3) * 2);
            sensor.AddObservation(Vector3.one - NormalizeVector3(m_BallRb.velocity, -10, 10) * 2);*/

            sensor.AddObservation(10 - Normalize(gameObject.transform.rotation.z, 0, 360) * 20);
            sensor.AddObservation(10 - Normalize(gameObject.transform.rotation.x, 0, 360) * 20);
            sensor.AddObservation(Vector3.one*10 - NormalizeVector3(ball.transform.position - gameObject.transform.position, -3, 3) * 20);
            sensor.AddObservation(Vector3.one*10 - NormalizeVector3(m_BallRb.velocity, -10, 10) * 20);

        }
    }

    public static float Normalize(float val, float min, float max)
    {
        return Mathf.Clamp(((val - min) / (max - min)), 0, 1);
    }

    public static Vector3 NormalizeVector3(Vector3 val, float min, float max)
    {
        val.x = Mathf.Clamp(((val.x - min) / (max - min)), 0, 1);
        val.y = Mathf.Clamp(((val.y - min) / (max - min)), 0, 1);
        val.z = Mathf.Clamp(((val.z - min) / (max - min)), 0, 1);
        return val;
    }

ervteng commented 3 years ago

This is interesting. I noticed that you clamp everything to 0 and 1. This usually will cause some issues with data being clipped away. For instance, in 3DBall the observations can be both positive and negative. If you "normalize" between 0 and 1, you still have negative values, which will be clipped away by the Mathf.Clamp function.

kkalera commented 3 years ago

Thanks for your reply @ervteng. I see what you're saying. I did normalize the positions using the negative positions available to the agent. I checked the maximum positions that are possible for the ball which were -3 and 3. Then normalized between these values. So -3 would result in 0 and 3 would result in 1.

Like in the code above: sensor.AddObservation(NormalizeVector3(ball.transform.position - gameObject.transform.position, -3, 3));

I'll redo my tests without clamping and will report back with the results.

kkalera commented 3 years ago

I did some retesting. The clamping did seem to have some effect on the performance. It's unclear to me why that's the case since I normalized using the negative values. That being said, there is still a performance loss when normalising the values.

Here are the results: new testing

To be honest, I have no clue what's causing this difference in performance and might be something inherit to the algorithm. My knowledge of the subject is currently not enough to speculate about why I'm getting these results.

Something that does seem interesting to me is that when normalised between -1 and 1, entropy dropped more significantly than the baseline, but did take about 15% longer to converge. Would this result in a more stable model because of lower entropy?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically closed because it has not had activity in the last 42 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Different training results when normalizing inputs #5454