Sohojoe / MarathonEnvsBaselines

Experimental - using OpenAI baselines with MarathonEnvs (ML-Agents)
Apache License 2.0
19 stars 3 forks source link

Bad perfomances with ppo stable-baselines #6

Open araffin opened 5 years ago

araffin commented 5 years ago

Hello, We recently fixed a bug in the ppo2 implementation that should solve the performance gap observed ;) So I recommend you to update to latest version. Btw, I'm quite interested in your benchmark results if you run same again.

See https://github.com/hill-a/stable-baselines/issues/75 Fixed in: https://github.com/hill-a/stable-baselines/pull/76

Sohojoe commented 5 years ago

@araffin - that is great to hear. I will merge with the latest and re-run the tests

Sohojoe commented 5 years ago

@araffin

I got it training using the same hyperparams that I used with openai.baselines

The good news is that hopper trains well:

I also trained walker2d and got good results as well.

A couple of bugs I’m struggling with:

1) Loading / running the trained model is not working well. Are you able to load / run saved models?

2) Tesorboard output is huge - almost 3GB for one training run of 1m steps. I dont see anything close to that with OpenAI.Baselines or with ML-Agents

araffin commented 5 years ago

Good news =)

Loading / running the trained model is not working wel

What do you mean by "not working well"? Training the RL zoo (https://github.com/araffin/rl-baselines-zoo, so 70+ agents), I did not have any problem

Tesorboard output is huge - almost 3GB for one training run of 1m steps

Yes, we log much more things than OpenAI, that also explain that the training is a bit slower. To switch to legacy tensorboard logging, instructions are here: https://stable-baselines.readthedocs.io/en/master/guide/tensorboard.html#legacy-integration

EDIT: verbose and tensorboard_log are two different things, verbose is for terminal output EDIT 2: don't forget to update stable_baselines version in the README ;) (to avoid misleading users)

Sohojoe commented 5 years ago

hmm - very strange; I thought it could be normalization but see that you are using that. Maybe I'm doing something dumb - I'll try again by building a script closer to what you have in zoo and see if that fixes it

I fixed the version number and once I get load/run fixed, I'll push a release

Sohojoe commented 5 years ago

@araffin I fixed it - the problem was with the save / load of the running average - basing my code on zoo code fixed it.

I'll try some more algorthems tomorrow

araffin commented 5 years ago

Perfect, I think i will link your repo once the new results are published ;)

Sohojoe commented 5 years ago

That would be great re the link!! - to give some context re the project.

I was not able to get other algorithms training.

It looks like you are further ahead with discrete control vs continuous controls. So I think I will push a release tomorrow and I can updating as you get more features online. The main thing I'm hoping for is more multiagent support. My next focus in to see if I get HER working on a simple test environment.

araffin commented 5 years ago

@Sohojoe thanks for the clarification =)

It looks like you are further ahead with discrete control vs continuous controls.

That's true, most of the algorithms were implemented for atari only at first. But, we plan to improve that in the future (we will release soon an implementation of SAC, I'm currently checking the perf before releasing it)

a2c - runs but does not train

Looks like a bug :/ (I had the same experience, A2C works well with discrete actions but I could not make it work with continuous actions yet) I'll open an issue

acer - NotImplementedError: WIP: Acer does not support Continuous actions yet. acktr - NotImplementedError: WIP: ACKTR does not support Continuous actions yet.

yep, those two are on the roadmap (for acktr, it is mainly refactoring, for ACER it is not implemented), but that will depends on the amount of free time with have...

My next focus in to see if I get HER working on a simple test environment.

HER is also on our roadmap (the refactoring is 70% done).

Ddpg - does not support multi-agent environments yet and does not support normalization. I

In fact, DDPG has its own normalization mechanism (this is legacy code), you just have to pass normalize_observations: True and normalize_rewards: True What type of noise did you use? And did it work with OpenAI baselines?

The things is because we did a big refactoring to simplify the interface, some bugs may have been introduced, so I'm constantly checking performance to be sure we did not mess anything.

Sohojoe commented 5 years ago

@araffin - I've been working on folding this and other experimental code back into Marathon Environments and its taken longer as Unity 2018.3 was a major physics update and improvement. I've also been adding features such as the ability to specify the number of concurrent agents. I also updated the observations and reward to be normalized as it does not make sense to enforce that on the algorithms.

Ideally, I would like to ship a pip that includes the executables for Windows, Mac, and Linux (so it can be a replacement for MuJuCo or Bullet) - but I'm not sure how to include executables in a pip (if you have any pointers, that would be great

araffin commented 5 years ago

I've been working on folding this and other experimental code back into Marathon Environments

Cool! Btw, we recently recently released v2.4.0 that ships with Soft Actor-Critic (SAC) and policy customization at model creation. SAC is particularly suited for environments with continuous actions, like Marathon Envs ;)

Ideally, I would like to ship a pip that includes the executables for Windows, Mac, and Linux (so it can be a replacement for MuJuCo or Bullet) - but I'm not sure how to include executables in a pip (if you have any pointers, that would be great

I'm afraid pip package does not really allow that. You can do it with anaconda though. I don't know if it possible, but for pypi, you could download the corresponding binary during installation of the package and show a warning for systems like arm where you don't have the corresponding binary.