Stable-Baselines-Team / stable-baselines3-contrib

Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code
https://sb3-contrib.readthedocs.io
MIT License
504 stars 175 forks source link

Implement PPG #26

Open janEbert opened 3 years ago

janEbert commented 3 years ago

As discussed in sb3#346, I'd like to merge an existing implementation of the PPG algorithm.

I'm unsure about the two SDE-related calls here and here; I just oriented myself on PPO, calling it before the policy is used for its forward pass (sadly haven't gotten to read your nice paper yet ;) ). Two other things worth pointing out:

  1. I added a use_paper_parameters flag.
  2. It's possible to initialize the policy weights the same way the paper did. I could put this behind a policy_kwarg and overwrite the init_weights method at runtime.

I also changed the line length to be PEP-compliant instead of having your desired line length of 127; I hope make format resets my changes. Finally, I will put in some comments clearly separating the different phases in the algorithm.

Other than that, there are

I can't reproduce the paper results due to space issues on my laptop, sadly. Will see if I can get to it on the weekend; if it's not possible for me, I hope a helpful someone can finish this for me. I will at least provide a code baseline for this as well.

araffin commented 3 years ago

I'm unsure about the two SDE-related calls here and here; I just oriented myself on PPO, calling it before the policy is used for its forward pass (sadly haven't gotten to read your nice paper yet ;) ).

For now, let's try to have something without SDE.

It's possible to initialize the policy weights the same way the paper did. I could put this behind a policy_kwarg and overwrite the init_weights method at runtime.

I need to think about it and we need to see the results. Usually the best place to keep the tuned hyperparams is the RL Zoo.

btw, this

self.policy_kwargs["activation_fn"] = th.nn.Identity

is weird. But I guess, this is because you assume PPG is using a CNN policy? Make sure to check the updated SB3 doc that explain feature extraction vs net arch: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

I also changed the line length to be PEP-compliant instead of having your desired line length of 127; I hope make format resets my changes.

I'm not sure what you did. You should keep the max line length as it is, but yes make format will reformat your lines if needed.

I can't reproduce the paper results due to space issues on my laptop,

for the evaluation, we need obviously to compare to original results (I guess it was on CoinRun only) and then I think it would be nice to have some results on classic control problem (simpler to run) and compare them to PPO.

I can also later run some experiments on pybullet/Atari envs.

EDIT: regarding the implementation, it would be better at first to have it inherit from OnPolicyAlgorithm rather than PPO

janEbert commented 3 years ago

For now, let's try to have something without SDE.

Sure, fine. :)

th.nn.Identity is weird. But I guess, this is because you assume PPG is using a CNN policy?

That's also part of the paper parameters, and they used an IMPALA encoder, so yeah, I guess that's the assumption. What should I look out for in the docs regarding feature extraction vs net architecture?

I'm not sure what you did [regarding PEP compliancy].

That's a max line length of 79, just in case you were actually wondering. :D

for the evaluation, we need obviously to compare to original results (I guess it was on CoinRun only) and then I think it would be nice to have some results on classic control problem (simpler to run) and compare them to PPO.

Sounds good to compare to PPO. The PPG paper used the whole Procgen suite.

I can also later run some experiments on pybullet/Atari envs.

That would be really great, thank you!

regarding the implementation, it would be better at first to have it inherit from OnPolicyAlgorithm rather than PPO

I absolutely understand why you'd want this; it was actually how I first programmed it as well. However, as PPG heavily reuses PPO, it just leads to a lot of copied code. As the code has already been used in practical experiments, outmatching PPO, I'd suggest keeping it the way it is for now. In case the reproduction experiments fail, we could still change the code to the more verbose version for debugging. What do you think?

araffin commented 3 years ago

I absolutely understand why you'd want this; it was actually how I first programmed it as well. However, as PPG heavily reuses PPO, it just leads to a lot of copied code.

this is the contrib repo, so a bit of code duplication is ok (see #16 and TQC for instance), especially if it is clearer to see what is happening.

Sounds good to compare to PPO. The PPG paper used the whole Procgen suite.

in fact, depending on how hard it is to reproduce results, I would first try to have results on simpler env (the classic control ones).

That's also part of the paper parameters, and they used an IMPALA encoder, so yeah, I guess that's the assumption. What should I look out for in the docs regarding feature extraction vs net architecture?

I see, I think an Impala CNN feature extractor would be a nice addition. So, when using CNN, by default, net_arch=[] (linear layer) which means you don't need to specify a different activation fn.

For now, let's try to have something without SDE. Sure, fine. :)

I may add it later depending on results for continuous actions

janEbert commented 3 years ago

I'm sorry this is taking so long again. For your comment about code duplication, I thought the issue is mostly that fixes and improvements to PPO will not be incorporated into PPG.

in fact, depending on how hard it is to reproduce results, I would first try to have results on simpler env (the classic control ones).

Sounds good.

I see, I think an Impala CNN feature extractor would be a nice addition. So, when using CNN, by default, net_arch=[] (linear layer) which means you don't need to specify a different activation fn.

Ah, that's great! I'll remove that line, then.

0xprofessooor commented 3 years ago

Hey guys, cool stuff here! Maybe I've misunderstood but I'm not entirely sure this implementation of PPG matches what is described in the paper. Again correct me if I'm wrong, but I don't think n_epochs is the same as the N_pi parameter in the paper because n_epochs defines how many times to use the same samples for training whereas we want to train the policy N_pi times but each time with a fresh set of environment experiences (sample reuse hurts on-policy!). I'd argue the n_epochs variable instead represents E_pi in the paper, so in essence we're only performing one policy phase update per auxiliary phase (a.k.a N_pi=1).

I've also had a stab at an implementation, although admittedly I'm having some performance issues with the auxiliary value head in the auxiliary loss function for some reason (maybe something we can work on together if you agree with what I've said above?), hopefully it illustrates at least what I mean in principle though: https://github.com/09tangriro/brain-PUBLIC-VERSION/tree/dev/brain/ppg

Also made it so that the string 'MlpPolicy' can be used rather than 'AuxMlpPolicy' and extends OnPolicy rather than PPO, let me know your thoughts and again super cool work :)

janEbert commented 3 years ago

Hey! Thank you so much for the review and bringing this issue up! I absolutely agree with everything you said, n_epochs would be more suitable as E_pi = E_V (if we assume we don't create a new variable so they are controllable separately).

Since I haven't gotten any work into this since starting the first issue over a month back, would you like to take over as you have a working, correct (fingers crossed ;) ) implementation?

0xprofessooor commented 3 years ago

Yeah sure I can start working on a pr 👍

araffin commented 3 years ago

Yeah sure I can start working on a pr +1

PR to @janEbert fork or PR to SB3 contrib? (for the latter make sure to read the contributing guide carefully first)

0xprofessooor commented 3 years ago

I'll do it for sb3 contrib, I've given it a look over and for now I'll just create the feature branch until it's ready :)

leleogere commented 2 years ago

Any progress on this algorithm?