Multiple environments when training: --num-envs vs multiple training areas within one application

Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

https://unity.com/products/machine-learning-agents

Other

17.18k stars 4.16k forks source link

Multiple environments when training: --num-envs vs multiple training areas within one application #2298

Closed KristianLN closed 5 years ago

KristianLN commented 5 years ago

Hi,

Coming of reading this blog post here, I came to wonder if there are a difference, and what it is, between using --num-env and multiple training areas within one application.

In the following, I'm assuming training is done using curriculum learning.

As I see it, assuming the initialization, and changing complexity, of the environments is correctly done, there should be no difference, expect less drag on the CPU because only one application is launched when running with multiple training areas within one application.

The environments: If initialization and changes in the multiple environments within one application are done randomly/independent of each other, the results should be equivalent to running with --num-envs, right?

In both ways, the agents are linked to one brain and gather a accumulated amount of experience, resulting in increased learning and so less training time, right?

The reason I'm asking, is because I have limited computational resources available, and I'm looking for ways to improve my training. My CPU runs at a 100% with --num-env = 2-3, depending of the specific configuration of the training session, and I don't know how much to push it.

Nanocentury commented 5 years ago

I can see multiple apps being useful in a distrubuted computing environment

KristianLN commented 5 years ago

@Nanocentury Thank you for commenting! Do you mind elaborating, and preferably related to the above mentioned use case?

Nanocentury commented 5 years ago

From what I understand you are correct that each instance of the application in which your NN is training adds some overheads. If you had N computors then running N instances of the app would make sense to me.

If like me you are training on your personal machine then running a single app with as many environments as you can fit would seem the most efficient.

However I can imagine there are some environmental designs that are not well suited to running concurrently within the same application, for instance if you were teaching your NN to do stuff with your app's local file system.

KristianLN commented 5 years ago

I see.. I think you are absolutely right. However, it is interesting if there is a difference between using --num-envs and multiple training areas within one application, or if the ability to run on multiple environments is speed up the process even more.

ervteng commented 5 years ago

@KristianLN there isn't one right answer for this. In many games, it's challenging (or impossible) to create multiple training areas, hence the --num-envs feature. But generally if you're on a CPU-limited single machine, creating multiple training areas will be more efficient. Furthermore, adding more envs or training areas when your computer is running at full throttle already will just slow all of them down, and will have little benefit.

roboserg commented 5 years ago

@ervteng I am still not sure how to determine the perfect amount of training areas within one unity scene. My i7 6700K with 4 cores / 8 HT is not maxing out even at 15 areas, but I don't see it training faster then with 9 areas. It seems something different is the bottleneck.

1. Should I just test a different amount of training areas and see which one converges faster? For example in the picture below orange line are 15 areas vs a grey line with 9 areas. How come orange is lagging even-though it has 60% more training areas?

chrome_2019-07-22_20-19-50

2. Will switching from 4 cores / 8HT to 12 cores / 24 HT (AMD Ryzen) speed up the training? In other words If I would use several instances of the unity scene will all them run in parallel on the 12 cores? In other words will I be able to run three times more training areas in parallel with 12 cores instead of with 4 cores as of right now?

KristianLN commented 5 years ago

@ervteng Thank you so much for confirming my suspicion! As I remember from the test I did, my computer did not max out with multiple areas within one application compared to running --num-env 1= 1.

@roboserg First of all, remember that the benefit of additional training areas are not linear in the number of areas, as of this article Furthermore, and I could be wrong, but I think what you see is the feedback-loop-effect, which is increasingly likely to affect your training, especially early on, as you increase the number of training areas. The reason one could suspect that to be the cause, is because I think it looks like, with that sparse information in the graph, that you have smoother training with 15 training areas. How does it look with less training environments? Let's say 3?

roboserg commented 5 years ago

@KristianLN 3 and 6 areas perform very poorly, as expected. 12 and 9 still rise faster then 15. The environment is soccer w/o the defender - https://puu.sh/DWmZf/d11e8e0b33.mp4 Rewards for touching the ball (0.01) and scoring the net (1) + every time step if the ball rolls towards the net + if the agent moves towards the ball.

What do you mean by the feedback-loop-effect in the context of reinforcement learning?

Photoshop_2019-07-23_15-19-13

ps. I have no clue why the blue graph with 12 areas suddenly dropped to 0. I didnt change anything during the training. I know DQN suffers from catastrophic forgetting, didnt think PPO does as well?

KristianLN commented 5 years ago

Hmm.. Does anything result in the agent receiving a penalty? Furthermore, have you specified a number of maximum steps allowed for the agent to take?

If your environment contains a certain degree of complexity and/or randomness, and there are no/too high a threshold to prevent the agent from taking unfavorable paths, you can end up with some agents searching forever, while others continuously improves. Those agents that search forever, will not benefit from the parallel environments until reset, which is a lack of feedback, and can result in the overall training becoming noisy. The lack of feedback from some agents effectively reduces the quality of the shared experience.

roboserg commented 5 years ago

Yea, forgot to say the max steps is 3000, so its time limited and the agent has to score as much as possible. Also each time step the agent receives small negative reward (-1/3000)

KristianLN commented 5 years ago

Okay, that makes the drop (at 12) even more strange. Unfortunately, I do not have a good guess on why the difference is so significant between 6 and 9 runs. How you tried running each number of parallel environments over multiple runs, to definitely rule out randomness? Perhaps just 9 and 6 as a start, to limit the number of curves on the graph.

roboserg commented 5 years ago

I will stick with 12 areas for now, since each experiment takes around 1h to do and I spend the whole day yesterday to try out different set ups. I think the sudden drop for 12 is some kind of an anomaly.

Bit I still dont understand why 15 areas would not rise faster. For the same amount of steps it gather more experience with 15 areas, so it should learn faster.

KristianLN commented 5 years ago

I'll close this issue, as I got my question answered. Thank you @ervteng !

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.