Question: batch learning with continuous action spaces (DDPG/TD3)

george-mathews commented 5 years ago

Hi team, Love the work on Coach! However, I'm trying to implement batch learning for a problem with continuous action spaces and it is not entirely clear whether this can be done.

Ideally I'd like to load an existing experience replay buffer (e.g. from pickle/csv), train for a while in offline mode, then continue training while interacting with an environment. The example notebook covering batch learning is pretty clear, but looking at the code, it seems BatchRLGraphManager only supports DQN and NEC agents.

Is BatchRLGraphManager actually needed to achieve the initial phase of offline learning ?

I'm guessing some thing similar can be put together without using BatchRLGraphManager, and loading the memory buffer directly and scheduling the training with BasicRLGraphManager.

Is this approach going to work, or is there something I've overlooked??

Thanks in advance.

gal-leibovich commented 5 years ago

Hi George,

Thanks a lot, I'm really happy to hear that you've liked Coach!

Batch RL in coach consists of both training and evaluation from offline data. It is mainly geared toward cases where no environment is available, and so we have to manage with what we have available.

The main problem with DDPG and TD3 with the BatchRLGraphManager is that both are predicting deterministic policies (i.e. no sampling from a gaussian distribution at the end layer of the actor), and thus there are no action probabilities available, which are needed for most of the off-policy evaluators. One other algorithm which might fit your goals is SAC, which is off-policy, continuous action and stochastic, and thus should fit your needs. We haven't tested it with the BatchRLGraphManager, so I'm sure there will be adjustments that will need to be put together to make this play nicely.

If you want to just do learning (without off-policy evaluation, while evaluating with an environment from time to time), then you might consider using the BasicRLGraphManager, while loading from a pickle/csv, and also setting no interaction with the environment, or very little interaction with the environment by adjusting the agent_params.algorithm.num_consecutive_playing_steps and agent_params.algorithm.num_consecutive_training_steps parameter. Setting the first to EnvironmentSteps(0), will make the GraphManager to not interact at all with the environment. Whereas setting it to EnvironmentSteps(1), for instance, and adjusting the num_consecutive_training_steps to be a large number, will make the GraphManager to do this number of training iteration one after the other, before interacting again with the environment.

george-mathews commented 5 years ago

Thanks Gal. Super useful. One last question: will this suggested approach also work with v0.11 (the one that AWS currently makes available). This is a rather old version and it seems quite a few changes have been made since it came out.

gal-leibovich commented 5 years ago

I think that SAC was officially supported since release 0.12.0.

You can just pip install rl-coach-slim --upgrade to get to 1.0.0. There were some tiny updates since 1.0.0, but I think nothing that should impact your work.

george-mathews commented 5 years ago

Thanks. All good.

IntelLabs / coach

Question: batch learning with continuous action spaces (DDPG/TD3) #401