Reward from training changes when I use --load to continue training

Phong13 commented 4 years ago

Describe the bug If I interrupt training and then attempt to resume using the --load parameter, there is a dip of random size in the mean reward. This dip was not there in version .8. It is there in versions .10 and .11.

The dip seems to happen if I switch between using a built environment with --num-envs=XX and training using the Unity Editor.

To Reproduce Steps to reproduce the behavior: Download and clean install ml-agents v.11.0 Build the 3DBallHard scene. Train it for a couple of minutes using the build. python ml-agents\mlagents\trainers\learn.py config\trainer_config.yaml --env="Builds/3DHardBall/Unity Environment.exe" --run-id=3DBallTest --save-freq=5000 --num-envs=16 --train

After a few minutes wait for a "Saved Model", then kill the training using [ctrl-C] Start they learn.py script and --load the model for use with the Unity Editor (I reduced the timeScale to 1 so I can see the behaviour) python ml-agents\mlagents\trainers\learn.py config\trainer_config.yaml --run-id=3DBallTest --save-freq=5000 --train --load

Let this run for a few minutes. Observe that there is a dip in the reward when the training starts. The graph seems to recover from the dip fairly quickly. The size of the dip seems random. Wait until the console says "saved model" then stop play in the Unity Editor to kill the training.

Start the training again using Build again with --num-envs and load: python ml-agents\mlagents\trainers\learn.py config\trainer_config.yaml --env="Builds/3DHardBall/Unity Environment.exe" --run-id=3DBallTest --save-freq=5000 --num-envs=16 --train --load Observe there is a huge dip this time in the mean reward although again it recovers fairly quickly. The size of the dip seems random.

These dips did not happen in v.8.0. In that version of ml-agents the mean reward would resume in an almost continuous way.

Console logs / stack traces Please wrap in triple backticks (```) to make it easier to read.

Screenshots The green lines show when I --loaded the model to restart. Notice the dip. The depth of the dip seems random. The second dip is very large.

Environment (please complete the following information):

OS + version: Windows 10 Pro
ML-Agents version: v.11.0 & v.10.1
TensorFlow version: 1.7
Environment: 3DBallHard
Machine is 16 core threadripper.

NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.

ervteng commented 4 years ago

Hi @Phong13, we did change how the number of steps were logged when using --num-envs=XX in 0.9. They're now logged as the total steps across environments, rather than assuming one environment (e.g. if you have 8 environments, one recorded "step" is actually 8 steps in v0.8, and 1 step in v0.9). This means you'll be recording summaries at a higher frequency and may see dips you won't see in v0.8.

Does the dip happen with --num-envs=1?

Phong13 commented 4 years ago

Thanks for the response. I will try experimienting with the --num-envs=1 to see if it only happens when switching between --num-envs=1 and num-envs>1.

Will this affect the training algorithm? Will it confuse PPO if I switch the --num-envs and resume training? The training seems to recover from the dip fairly quickly, (faster than it would take to re-learn the lost knowledge) but slower than I would expect. I would expect the dip to clear itself in one or two resets of each agent. It seems to take MUCH longer than that to recover from the dip.

The dip seems to happen even if I resume training using the same --num-envs. Here is some example output where I do NOT switch the --num-envs. The reward drops from 11-20 to 5-8 when I resume. The reward stays in the lower range for over 30k steps and counting.

FIRST RUN:


INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12217743. Time Elapsed: 1253.471 s Mean Reward: 13.209. Std of Reward: 23.204. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12218743. Time Elapsed: 1274.913 s Mean Reward: 20.419. Std of Reward: 33.086. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12219743. Time Elapsed: 1284.694 s Mean Reward: 11.160. Std of Reward: 20.645. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10095847. Time Elapsed: 1284.711 s Mean Reward: 58.667. Std of Reward: 13.603. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12220744. Time Elapsed: 1305.315 s Mean Reward: 17.581. Std of Reward: 28.757. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12221743. Time Elapsed: 1325.829 s Mean Reward: 14.007. Std of Reward: 26.339. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12222743. Time Elapsed: 1335.527 s Mean Reward: 17.353. Std of Reward: 29.131. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10098847. Time Elapsed: 1335.536 s Mean Reward: 58.433. Std of Reward: 12.810. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12223743. Time Elapsed: 1356.651 s Mean Reward: 15.037. Std of Reward: 26.640. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12224743. Time Elapsed: 1366.723 s Mean Reward: 16.577. Std of Reward: 27.787. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12225743. Time Elapsed: 1387.484 s Mean Reward: 18.457. Std of Reward: 31.844. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10101847. Time Elapsed: 1387.494 s Mean Reward: 58.037. Std of Reward: 13.871. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12226743. Time Elapsed: 1397.364 s Mean Reward: 16.226. Std of Reward: 29.070. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12227743. Time Elapsed: 1417.817 s Mean Reward: 14.887. Std of Reward: 25.974. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12228743. Time Elapsed: 1438.794 s Mean Reward: 16.025. Std of Reward: 26.960. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10104847. Time Elapsed: 1438.801 s Mean Reward: 58.154. Std of Reward: 14.002. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12229743. Time Elapsed: 1448.802 s Mean Reward: 17.322. Std of Reward: 31.237. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12230745. Time Elapsed: 1469.318 s Mean Reward: 15.368. Std of Reward: 25.114. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12231743. Time Elapsed: 1479.463 s Mean Reward: 13.405. Std of Reward: 25.155. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10107847. Time Elapsed: 1479.480 s Mean Reward: 56.817. Std of Reward: 14.140. Training.```

Now I stop [CTRL-C] and restart with exactly the same command line (using the same --num-envs)
Notice that the reward drops from 11-20 range to 5-8 range. The reward stays in the 5-8 range for many thousands of steps.

INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12233054. Time Elapsed: 12.496 s Mean Reward: 1.495. Std of Reward: 1.575. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12234054. Time Elapsed: 23.014 s Mean Reward: 5.634. Std of Reward: 19.058. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12235054. Time Elapsed: 47.078 s Mean Reward: 5.837. Std of Reward: 18.223. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10111156. Time Elapsed: 47.090 s Mean Reward: 56.606. Std of Reward: 14.773. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12236054. Time Elapsed: 58.198 s Mean Reward: 7.404. Std of Reward: 25.398. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12237054. Time Elapsed: 81.132 s Mean Reward: 6.309. Std of Reward: 18.655. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12238054. Time Elapsed: 103.025 s Mean Reward: 6.318. Std of Reward: 22.586. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10114156. Time Elapsed: 103.034 s Mean Reward: 56.810. Std of Reward: 14.473. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12239054. Time Elapsed: 113.836 s Mean Reward: 5.784. Std of Reward: 15.890. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12240055. Time Elapsed: 134.933 s Mean Reward: 5.468. Std of Reward: 20.378. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12241054. Time Elapsed: 155.485 s Mean Reward: 7.066. Std of Reward: 22.630. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10117156. Time Elapsed: 155.500 s Mean Reward: 55.891. Std of Reward: 14.088. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12242054. Time Elapsed: 177.000 s Mean Reward: 6.768. Std of Reward: 19.976. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12243054. Time Elapsed: 197.970 s Mean Reward: 6.917. Std of Reward: 20.901. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12244054. Time Elapsed: 208.210 s Mean Reward: 7.074. Std of Reward: 21.412. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10120156. Time Elapsed: 208.226 s Mean Reward: 57.065. Std of Reward: 12.698. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12245054. Time Elapsed: 229.102 s Mean Reward: 7.418. Std of Reward: 24.697. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12246054. Time Elapsed: 239.235 s Mean Reward: 8.265. Std of Reward: 24.186. Training.
INFO:mlagents.envs:Saved Model
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12247054. Time Elapsed: 260.504 s Mean Reward: 6.809. Std of Reward: 19.344. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10123156. Time Elapsed: 260.517 s Mean Reward: 57.928. Std of Reward: 12.869. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12248054. Time Elapsed: 281.370 s Mean Reward: 8.070. Std of Reward: 25.212. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12249054. Time Elapsed: 291.562 s Mean Reward: 6.238. Std of Reward: 20.732. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacter: Step: 12250054. Time Elapsed: 312.365 s Mean Reward: 7.452. Std of Reward: 22.609. Training.
INFO:mlagents.trainers: HumanoidWalk31: HumanoidCharacterPhasePrediction: Step: 10126156. Time Elapsed: 312.375 s Mean Reward: 58.415. Std of Reward: 12.721. Training.```

ervteng commented 4 years ago

This shouldn't affect the training - I think it's either a reporting issue or randomness in the training. I think you're still OK with using these models. I've logged the issue with internal reference MLA-365. Please keep us posted with what happens in your experiments so we can diagnose the issue.

Phong13 commented 4 years ago

I did some experiments. Here are the findings.

I tried switching to the "Walker" example because it takes longer to train and is more similar to my environments. However the bug does NOT happen with the "Walker" environment. When I stop and restart, the reward resumes perfectly regardless of what I do with the --num-envs. I tried using 1, env, 16 envs and switching multiple times between 1 env and 16 envs. No bug. However the walker environment has a very small standard deviation with the reward. I think that might be related.

The bug does happen with the 3DBallHard environment. Even when I use 1 environment (although it recovers very quickly. Here is the tensorflow graph. I didn't change the length of the ephisode for this run. I just restarted multiple times always using --num-envs=1. You can see the downward spikes where the reward drops. Notice that the spikes correspond to "short" ephisiode lengths. I think the ephisode length might be related.

The bug is worse if I use --num-envs=16. Here is the same graph, but using --num-envs=16 with multiple restarts. I never changed the num-envs. You can see the dips, but they are deeper and clearly take longer to recover from.

My current hypothesis is that this may be related to the structure of the reward in the environments and the "short" ephisode lengths. In my own environments I start a bunch of agents in random poses. Some of these poses are hard to recover from and the agent resets many times very quickly with low rewards. The high reward successful agents take some time to return their rewards. I wonder if, when --load is used, the first few ephisodes contain only low rewards.

Phong13 commented 4 years ago

I just ran a test with the 3DBallHard environment using v.8 and the spikes are far less severe. There are some spikes but they appear to be limited to the first ephisode, then the training fully recovers immediately. This is very different from v.11 where it feels like there is a significant loss of training progress that needs to be re-learned with each --load.

ervteng commented 4 years ago

The spikes are smaller in v0.8 b/c of the recording interval change - they're still there, but averaged over many runs. I'm thinking it has something to do, as you said, with the reward structure, and the fact that all of the 3DBalls start at the same time on initialization (but in subsequent runs, they'll randomize), so if the first couple are low rewards, the first log is low.

In any case, as long as your agents recover after some time, I wouldn't worry too much - it's just an artifact of reporting, and won't affect your model weights.

Phong13 commented 4 years ago

I finally discovered the cause of this!

One of the differences between v.08 and v.10 is the sampling file. I discovered that AcademyReset was only being called once in v.10 whereas in v.08 it was called frequently. This explains the dramatic drop when I use --load since the model has over-fitted to my initial random start conditions (which never change until I --load). When I add a sampling file, AcademyReset is being called frequently again and the agent is generalizing to different random start conditions. When I restart with --load there is only a very short drop in the reward.

I think you can close this if you like. Although you might want to add some example sampling .yaml files for the examples. There are none as of V.11. Any of the examples that do randomized start conditions in AcademyReset will have this problem. They will over-fit to one set of random start conditions and if restarted, there will be a huge loss in reward progress that needs to be re-learned.

Phong13 commented 4 years ago

If I can suggest a few easy improvements to help other people avoid this problem.

Add a comment in the source code mentioning the Sample file for the method Academy.AcademyReset so that if anyone tries to override it to randomize start conditions they likely need to use the --sampler parameter.

Also this documentation page with the section on Resetting the environment needs to mention the --sampler parameter:

https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Design-Academy.md

It should also have a link to this page about generalization.

https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-Generalized-Reinforcement-Learning-Agents.md

ervteng commented 4 years ago

Hi @Phong13, thanks for the feedback! Just so I understand correctly, you're using a sampling configuration, right? This bug doesn't happen if you're not using --sampler?

We do have one sampler config 3dball_generalize.yaml included with the repo - but you're right it's not obvious that it's a sampler config just from looking at it!

Phong13 commented 4 years ago

Not exactly:

The bug is that without a sampler file, agents will over-fit and the symptoms are as described.

Previously I was using ML-Agents v.08. In this version the sampler file did not exist. It was not necessary because ML-Agents would automatically call AcademyReset periodically (I think after each batch). The batch was the episode. There was no bug because the AcademyReset was being called regularly which randomized my initial conditions and the brain would generalize. When I restarted with --load, the reward would recover to its previous level very quickly.

Then I upgraded to v.10, BUT DID NOT ADD the sampler file (I did not initially understand that it was needed). The bug would happen in this scenario. Without the sampler file AcademyReset is only called once at the beginning of training or when there is a switch to a new cirriculum stage. There is only one episode per ciriculum stage. My agents would train on the same start conditions for many-many hours. They would over-fit. If I stopped and restarted with --load, AcademyReset would be called and the initial conditions would change. My over-fitted-agent would have a hard time with the different start conditions, reward woud be low, and training would take a long time to recover. The depth of the loss-in-reward was random because it depended on how close the new random start conditions were to what had been trained previously. The final trained agent wasn't great because it was over-fitted to one start condition (not generalized). This is the bug.

Now I have added a sampler file and AcademyReset (new episode) is being called regularly again. My agents are training nicely on a wide variety of start conditions. If I restart with --load the reward recovers very quickly and resumes as expected.

If there is no sampler file, then there is only one episode per cirriculum stage and you will get an over-fitted agent. The symptoms are as described in this bug.

On Wed, Dec 11, 2019 at 11:36 AM Ervin T. notifications@github.com wrote:

Hi @Phong13 https://github.com/Phong13, thanks for the feedback! Just so I understand correctly, you're using a sampling configuration, right? This bug doesn't happen if you're not using --sampler?

We do have one sampler config 3dball_generalize.yaml included with the repo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unity-Technologies/ml-agents/issues/2932?email_source=notifications&email_token=ACWVKEM6DOSMRY4AH2KMM6DQYE6KJA5CNFSM4JO3XDR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUJ6OA#issuecomment-564698936, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWVKEIKPXOIMIQAXOTQVALQYE6KJANCNFSM4JO3XDRQ .

-- Ian Deane DigitalOpus http://digitalopus.ca Twitter: @DigitalOpus https://twitter.com/DigitalOpus

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 28 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Reward from training changes when I use --load to continue training #2932