Bad perfomances with ppo stable-baselines

araffin commented 6 years ago

Hello, We recently fixed a bug in the ppo2 implementation that should solve the performance gap observed ;) So I recommend you to update to latest version. Btw, I'm quite interested in your benchmark results if you run same again.

See https://github.com/hill-a/stable-baselines/issues/75 Fixed in: https://github.com/hill-a/stable-baselines/pull/76

Sohojoe commented 6 years ago

@araffin - that is great to hear. I will merge with the latest and re-run the tests

Sohojoe commented 5 years ago

@araffin

I got it training using the same hyperparams that I used with openai.baselines

The good news is that hopper trains well:

Score ~870 (openai.baselines.ppo2 scores: 700, ml-agents.ppo scores: 455)
Training time is slightly slower: 555s / 9:15s (openai.baselines.ppo2: 475s, ml-agents.ppo: 418s)
... however if verbose=1 (no tensorboard) training time jumps to 374s / 6:14s - see below

I also trained walker2d and got good results as well.

A couple of bugs I’m struggling with:

1) Loading / running the trained model is not working well. Are you able to load / run saved models?

2) Tesorboard output is huge - almost 3GB for one training run of 1m steps. I dont see anything close to that with OpenAI.Baselines or with ML-Agents

araffin commented 5 years ago

Good news =)

Loading / running the trained model is not working wel

What do you mean by "not working well"? Training the RL zoo (https://github.com/araffin/rl-baselines-zoo, so 70+ agents), I did not have any problem

Tesorboard output is huge - almost 3GB for one training run of 1m steps

Yes, we log much more things than OpenAI, that also explain that the training is a bit slower. To switch to legacy tensorboard logging, instructions are here: https://stable-baselines.readthedocs.io/en/master/guide/tensorboard.html#legacy-integration

EDIT: verbose and tensorboard_log are two different things, verbose is for terminal output EDIT 2: don't forget to update stable_baselines version in the README ;) (to avoid misleading users)

Sohojoe commented 5 years ago

hmm - very strange; I thought it could be normalization but see that you are using that. Maybe I'm doing something dumb - I'll try again by building a script closer to what you have in zoo and see if that fixes it

I fixed the version number and once I get load/run fixed, I'll push a release

Sohojoe commented 5 years ago

@araffin I fixed it - the problem was with the save / load of the running average - basing my code on zoo code fixed it.

I'll try some more algorthems tomorrow

araffin commented 5 years ago

Perfect, I think i will link your repo once the new results are published ;)

Sohojoe commented 5 years ago

That would be great re the link!! - to give some context re the project.

This repro is experimental - the main project repro is here. At some point, I'll fold the learning back into the main project and/or to the master ML-Agents project. I think it will make sense to try and push the multiagent code into the master ML-Agent project and that could have the greatest value so that anyone using ML-Agents has more algorithms available to them. Right now there is a basic gym wrapper but it doesn't work well with multi-agents.
My focus with MarathonEnvs is on experimenting with how much of RL locomotion research can be applicable to video games / will transfer to a mature video game engine (i.e. Unity / PhysX) - so I have a few different experiments such as learning from mocap, navigating terrains, and implementing a player controller.
I can see potential ways to make MarathonEnvs more accessible as an alternative to Mujoco or Bullet such as making it a standalone python install (like Bullet or some of the other additional gym environments) and by making the number of concurrent agents a hyper-parameter (right now I build as 16 or as 1)
- I think a major advantage MartahonEnvs may have over Mujoco and Bullet is performance - especially when scaling up the number of concurrent agents. For the learning from mocap experiments, I had 64 agents concurrent agents and found it trained 4x faster than what DeepMimic (the paper I based it on) was getting with a machine with more cores.
- That said it would be good to get feedback from you and others of where they see potential value.

I was not able to get other algorithms training.

It looks like you are further ahead with discrete control vs continuous controls. So I think I will push a release tomorrow and I can updating as you get more features online. The main thing I'm hoping for is more multiagent support. My next focus in to see if I get HER working on a simple test environment.

a2c - runs but does not train - I was able to train with openai.baselines (it trained well but does not report score)
acer - NotImplementedError: WIP: Acer does not support Continuous actions yet.
acktr - NotImplementedError: WIP: ACKTR does not support Continuous actions yet. - I was able to train with openai.baselines (it trained well but does not report score)
Ddpg - does not support multi-agent environments yet and does not support normalization. I got a single agent training but it did not train well.

araffin commented 5 years ago

@Sohojoe thanks for the clarification =)

It looks like you are further ahead with discrete control vs continuous controls.

That's true, most of the algorithms were implemented for atari only at first. But, we plan to improve that in the future (we will release soon an implementation of SAC, I'm currently checking the perf before releasing it)

a2c - runs but does not train

Looks like a bug :/ (I had the same experience, A2C works well with discrete actions but I could not make it work with continuous actions yet) I'll open an issue

acer - NotImplementedError: WIP: Acer does not support Continuous actions yet. acktr - NotImplementedError: WIP: ACKTR does not support Continuous actions yet.

yep, those two are on the roadmap (for acktr, it is mainly refactoring, for ACER it is not implemented), but that will depends on the amount of free time with have...

My next focus in to see if I get HER working on a simple test environment.

HER is also on our roadmap (the refactoring is 70% done).

Ddpg - does not support multi-agent environments yet and does not support normalization. I

In fact, DDPG has its own normalization mechanism (this is legacy code), you just have to pass normalize_observations: True and normalize_rewards: True What type of noise did you use? And did it work with OpenAI baselines?

The things is because we did a big refactoring to simplify the interface, some bugs may have been introduced, so I'm constantly checking performance to be sure we did not mess anything.

Sohojoe commented 5 years ago

@araffin - I've been working on folding this and other experimental code back into Marathon Environments and its taken longer as Unity 2018.3 was a major physics update and improvement. I've also been adding features such as the ability to specify the number of concurrent agents. I also updated the observations and reward to be normalized as it does not make sense to enforce that on the algorithms.

Ideally, I would like to ship a pip that includes the executables for Windows, Mac, and Linux (so it can be a replacement for MuJuCo or Bullet) - but I'm not sure how to include executables in a pip (if you have any pointers, that would be great

araffin commented 5 years ago

I've been working on folding this and other experimental code back into Marathon Environments

Cool! Btw, we recently recently released v2.4.0 that ships with Soft Actor-Critic (SAC) and policy customization at model creation. SAC is particularly suited for environments with continuous actions, like Marathon Envs ;)

Ideally, I would like to ship a pip that includes the executables for Windows, Mac, and Linux (so it can be a replacement for MuJuCo or Bullet) - but I'm not sure how to include executables in a pip (if you have any pointers, that would be great

I'm afraid pip package does not really allow that. You can do it with anaconda though. I don't know if it possible, but for pypi, you could download the corresponding binary during installation of the package and show a warning for systems like arm where you don't have the corresponding binary.

Sohojoe / MarathonEnvsBaselines

Bad perfomances with ppo stable-baselines #6