Cumulative reward decreased dramatically after at some points.

Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

https://unity.com/products/machine-learning-agents

Other

16.89k stars 4.11k forks source link

Cumulative reward decreased dramatically after at some points. #2342

Closed gzrjzcx closed 4 years ago

gzrjzcx commented 5 years ago

Hi,

I am using the curriculum training for my agent. Firstly everything looks nice. However, after several attempts, the cumulative rewards of my agents dropped significantly like the below screenshot:

I have 24 envs training together, and using the default training config for PPO. And I have normalized the reward between -1 to 1, observation from 0 to 1. I can't figure out where is the problem. I have tried several times but all of them suffered from the weird problem. Is it relative to the PPO algorithm?

roboserg commented 5 years ago

I have something similar happening, but in my case it always recovers https://puu.sh/DXQ1M/20af243bdc.png

ScriptBono commented 5 years ago

Which version of ml-agents are you using? I had a similar issue but it got fixed during the most recent update to v0.8.2.

gzrjzcx commented 5 years ago

Which version of ml-agents are you using? I had a similar issue but it got fixed during the most recent update to v0.8.2.

I am using 0.8.1 now, what is your previous version?

gzrjzcx commented 5 years ago

I have something similar happening, but in my case it always recovers https://puu.sh/DXQ1M/20af243bdc.png

I have tried to trained more time, it increased again indeed, but very slow. You can see that I have trained 10M times. It needs about a whole day... And I am not sure if I trained more time(20M), it will excess than before?

harperj commented 5 years ago

Hi @gzrjzcx -- this sort of instability is relatively common with reinforcement learning algorithms (including PPO). You might be able to avoid it by exploring different hyperparameters, though it's very dependent on your environment so it's hard to give specific advice. One thing you might consider is that if you increase the number of parallel environments we don't automatically increase the buffer size.

gzrjzcx commented 5 years ago

Hi @harperj, Thanks for your reply. I am a little confused about the meaning of parallel environments. In fact I have duplicated 24 agents in one scene, just like your example environmet, is it parallel environment? Should I change the buffer size for this?

I have also set the --num-envs=4 to run concurrent Unity Instances. It shows that the Academy started four times. Is it parallel envs? Should I change the buffer size for this? And batch size?

By the way, what is the different between --num-envs and --num-runs? According to my understand, --num-runs cannot increase performance right? because each run is an independent session.

And is it possible to continue training from a checkpoint? For example, my training is interrupted due to time limit. Can I use the --load flag to continue this training from the last checkpoint?

harperj commented 5 years ago

Hi @gzrjzcx -- with parallel environments I was specifically referring to --num-envs. This means additional Unity instances running in parallel. That said, it similarly applies to multiple independent agents running within the same Unity instance/scene as you mentioned.

You're correct about --num-runs. This feature was simply intended to allow a way to run independent trials of an experiment in parallel for the purpose of measuring how consistent training performance is.

RE: continuing training, yes, the --load flag will continue training from the most recent checkpoint.

gzrjzcx commented 5 years ago

Thanks @harperj, So, both running multiple agents within one scene and parallel environments situations, all need to reset the buffer size right?

harperj commented 5 years ago

Hi @gzrjzcx -- I'm not totally sure what you mean by reset the buffer size exactly -- but here's an example of what I mean:

Buffer size is 1024 Average episode length is 512 steps. Time horizon is 64 steps. 1 agent -- ~2 full episodes per buffer (1 agent, 2 episodes) 2 agents -- ~2 full episodes per buffer (2 agents, 1 episode each) 4 agents -- ~ 4 half episodes per buffer (2 agents, 0.5 episode each)

You can see how this progresses until you may only have small parts of an episode in each buffer. So there is some relationship between the number of agents you are using and your buffer size.

gzrjzcx commented 5 years ago

Hi, @harperj In terms of your example, the time horizon is 64 steps means every 64 steps the observations of these 64 steps are collected to the buffer right? Then in the 4 agents case, 4 agents generate observations in the same time, then once the buffer is full each agent only contributes 0.5 episode right? If it is, then is it should be: 4 agents -- ~ 4 half episodes per buffer (4 agents, 0.5 episode each) ?

And in this case, we need to increase the buffer size to 2048, so that each agent is able to have 1 episode when the buffer is full?

harperj commented 5 years ago

@gzrjzcx -- you're correct, it should be 4 agents with 0.5 episode each. I would agree that in that case it's probably best to increase your buffer size so that you are seeing experiences from all parts of an episode, though you can really only determine the best parameters via exploration :-)

gzrjzcx commented 5 years ago

Thanks @harperj , clear now!

So sorry to disturb you again. The PPO algorithm employs neural network to parameterize the policy and does the policy gradient right? Is is possible to know the specific structure of the network? For example, as for the DQN, it is consisted of several convolution networks plus a fully connected layer. I am not familiar with the TensorFlow, and I have checked the source code, but not sure where is intended to stack the layers.

And in the LSTM case, if I using the LSTM(i.e. set the use_recurrent = true), is it means that I have used the RNN? And what is the specific structure of current networks? Is it like DQRN, just replacing the fully connected layer as the recurrent layer?

Also, what is the difference between stacked vector and LSTM?

harperj commented 5 years ago

Hi @gzrjzcx -- unfortunately the code is the only full reference for the model at this time. To your specific question, the use_recurrent method adds an LSTM after the input / observation encoding (CNN in the case of camera observations). I think what you've described with DQRN should be similar to this approach. With PPO we have both the policy and value network, which will have their own LSTM.

Stacked vector observations mean we're stacking a history of the vector observations before inputting them into the network, but won't have an actual recurrent component in the network.

cloudjay commented 5 years ago

Hi! I'm not an expert, but maybe your agent is exploring unwanted regions. How about lowering beta for reducing random exploration? By default, UnityML PPO lowers beta from 1e-3 to 1e-5 in models.py

harperj commented 4 years ago

Thanks for the discussion. We are closing this issue due to inactivity. Feel free to reopen it if you’d like to continue the discussion.

cloudjay commented 4 years ago

Recently in my case, I changed the learning rate smaller and that helped. It's not a solution, so I'm just adding a comment. :)

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.