alexis-jacq / Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286
MIT License
180 stars 40 forks source link

Old policy? #2

Closed Kaixhin closed 7 years ago

Kaixhin commented 7 years ago

Great work! I'm also working on a PPO implementation, but I completely miss where π and π_old come from. Here you store the policy output when actually acting on the environment - if you stored this and retrieved it from the memory, wouldn't it be the same as calculating them again in a batch like you do here?

You then construct a new policy, and calculate the new policy output here. I see that it is different because you load weights that have been updated by other processes, but in a synchronous setting, the weights wouldn't have been updated, and hence the policy outputs wouldn't be any different?

alexis-jacq commented 7 years ago

Thanks! Yes, this is why with the synchronous updates, workers have to wait for the update in order to continue the loop of policy updates. . Otherwise your right, the new policy would not be updated. The trick is to find a way to send a message allowing workers to go back working when the update is done.

Kaixhin commented 7 years ago

Sorry, still missing where π and π_old come from for calculating the update, i.e. where does π1 come from below?

π0 -> sync workers collect data -> update via A * π1/π0 ? -> send π1 (π2?) to workers, repeat

As for how to stop the workers running until an update is over, you can use locks, as I did for my async code. Create a lock around something that tells you which stage (collecting data/updating) you are at, and don't run the other stage until you (acquire the lock) and find that the previous stage is over.

ethancaballero commented 7 years ago

One issue is that each of your (4) train threads performs the size 10^4 batch update asynchronously with respect to each other. This normally is not an issue when each update consists of 1 rollout, but could become a problematic in your implementation because it uses larger updates based on batches of 10^6 rollouts. My suggestion to remedy this issue would be to use a single Replay Memory that is shared across all train threads, and then perform a single update (maybe on a gpu) (and with all the forward rollouts paused during this update) when that single shared replay memory contains number_of_rollouts=batch_size

alexis-jacq commented 7 years ago

@Kaixhin, if I understand correctly the paper, p_old is the policy before the updates and remains unchanged during the updates. And p_new is the policy after all the last updates. So you have something more like:

p_0 -> workers collect data -> set p_old=p_0 ->
1st update with A * p_0/p_old , get p_1 -> 
2nd update with A * p_1/p_old , get p_2, ->
repeat

Create a lock around something that tells you which stage (collecting data/updating) you are at, and don't run the other stage until you (acquire the lock) and find that the previous stage is over.

Thanks, that's a good idea, I have seen this clock timer in some other codes, I will try something based on it.

@ethancaballero, the 10^6 is a typo, thanks! You are right, it should be exactly the batchsize, otherwise we loose the advantage of parallelism. The idea of a shared Replayer looks great, but it would be one batch loss computation at the time, while the advantage of parallelism is doing n simultaneous batch loss computations.

Kaixhin commented 7 years ago

I suspected that π_old was simply bootstrapped at the first round of updates, so thanks for confirming! Instead of keeping it fixed, what do you think of the following (which only differs from what you wrote on the 3rd update)?

π_0 -> workers collect data -> set π_old=π_0 ->
1st update with A * π_0/π_old , get π_1 -> 
2nd update with A * π_1/π_0 , get π_2, ->
3rd update with A * π_2/π_1 , get π_3, ->
repeat

I'm working on the original/non-distributed version of PPO, or at least the version in OpenAI's blog, and basing it on my A2C code. Still thinking of a way to do some nice batched updates for standard A2C, but once I get that done I'll turn my attention to PPO.