How to tune the hyper parameters of new RL algorithms?

xlnwel commented 4 years ago

Hi,

I've just finished my implementation of Dreamer. May I ask you several questions that have been bothered me for a long time? When you design an agent from scratch and unfortunately things go south, how can you tell whether it is because of the wrong hyperparameters/networks or because of the idea itself? How do you find the right parameters when you design an agent like Dreamer with tens of hyperparameters? As a simplest example, how do you decide that it's time to tune the network architecture and the loss weights? Currently, I only know random/grid search, but I find these methods soon become frustrating for a large project that requires a large amount of resources to train and has many hyperparameters, such as Dreamer.

I know that these questions are irrelevant to your project, but I really hope you could share some of your experience with me. Thanks in advance :-)

danijar commented 4 years ago

That's a good question and depends on the project. When developing a completely new RL algorithm, tuning it will likely require a good amount of time and compute resources. I also think that people often give up on ideas too early because of this. It helps to have a lot of metrics to see what might be going wrong but it's still a lot of trial and error. For ideas that are compatible with previous algorithms, you can start from an existing implementation and hope that its hyper parameters will also work for your modification.

xlnwel commented 4 years ago

Hi, @danijar

In many cases, I know the basic meaning of the hyperparameters, but I have no clue when and how to tune them. Just as you mentioned in this issue, you suggested to tune kl_scale and deter_size for Atari games. But what makes you think so?

Another example involves some papers from DeepMind, which prefer RMSprop to Adam and use a different epsilon than the default setup. I know the underlying mechanism of these optimizers, but I have no idea in which situation one should be preferred than the other. Here's some resources I collected about optimizers, which also involves some of my personal thoughts:

Most RL papers use either RMSprop or Adam as the optimizer. From this discussion, I summarize several cases that RMSprop may be preferable over Adam:

The reason RMSprop may be preferable is because of the unclear effect of momentum on RL.

RMSprop is more stable in non-stationary problems and with RNNs

RMSprop is more suitable for sparse problems

𝝐 is generally chosen from 1e-8∼1e-4. 𝝐 affects the step size: Large 𝝐 corresponds to small step size, stable training, and slow training progress. For small projects(e.g., mujoco environment), setting 𝝐 to 1e-8 could speed up the training and get away from local optima. For large projects, 𝝐 is usually set to 1e-5 ∼ 1 for stable training.

Do these make sense to you?

danijar commented 4 years ago

There is barely any theory behind this. Most of the time, it's just that people have tuned many parameters and over time have found which of them are the most sensitive for a particular algorithm. I also think Adam tends to work better than RMSProp even for reinforcement learning, but again this is only from experience and from seeing what more recent papers are using.

xlnwel commented 4 years ago

Hi, @danijar

Thanks, I see. Then why would you suggest to tune kl_scale and deter_size for Atari games? Are you indicating that Atari games are more complicated to model, and therefore, kl_scale and deter_size should be larger than they are for DeepMind Control?

danijar commented 4 years ago

Yes, deter_size should be larger because there is more to keep track of by the model and kl_scale should be smaller to allow the model to incorporate more information from each image than in DMC tasks. I've actually ran those experiments so I know that it helps. I will update the repository here at some point, but it's not ready yet.

xlnwel commented 4 years ago

Hi, @danijar

Thanks for your insights. Well, it is unexpected to me that kl_scale should be smaller. I thought it was supposed to be larger because the actor was trained based on the imagined features derived from the prior. Therefore, I thought if the prior were more close to the posterior, the actor would perform better. What I omitted before was that, as you said, when kl_scale is larger, the posterior loses more information during encoding, which makes it harder for the actor comes up with the right actions. I think there may be a tradeoff between these two situations.

danijar / dreamer

How to tune the hyper parameters of new RL algorithms? #12