Stanford-ILIAD / PantheonRL

PantheonRL is a package for training and testing multi-agent reinforcement learning environments. PantheonRL supports cross-play, fine-tuning, ad-hoc coordination, and more.
MIT License
121 stars 18 forks source link

In addition to PPO, other methods of stable baseline 3 can't be used in the overcooked environment or get a extremely bad strategy. #4

Closed momo-xiaoyi closed 2 years ago

momo-xiaoyi commented 2 years ago

Such as DQN and A2C, the result is extremely bad with a reward of nearly 0. It makes me confused. Other methods in stable baseline 3 do not support a discrete action space (e.g. SAC, TD3 ...).

bsarkar321 commented 2 years ago

Hi! Can you let us know what command/script you are using to train these agents? Using a modified version of trainer.py (which just includes the other SB3 algorithms in the training script), I was able to get 16.6 for A2C, 11.9 for PPO, 9.79 for DQN in 10000 timesteps. As you pointed out, the other methods in SB3 have limited action/observation spaces so they probably cannot help for this environment.

For reference, here is the command for training PPO python3 trainer.py OvercookedMultiEnv-v0 PPO PPO --env-config '{"layout_name":"simple"}' --seed 10 --preset 1 --total-timesteps 10000

momo-xiaoyi commented 2 years ago

Thanks for your reply. Since the source code does not register A2C and DQN methods, I simply added A2C and DQN registration. In my test, in the beginning, A2C did get seemingly normal results, but when the training time reached 300,000 timesteps, the value slowly decayed to zero, which confused me.

For reference, here is my command for training A2C

 python3 trainer.py OvercookedMultiEnv-v0 A2C A2C --env-config '{"layout_name":"simple"}' --seed 5 --preset 1 -t 1000000 

DQN is similar to A2C.

bsarkar321 commented 2 years ago

I ran your command on my computer, and I'm getting similar results for seed 5 (peaking at 9 and going back down to 0 for A2C). I think hyperparameter tuning might be needed for these models to perform at their peak (using the --ego-config command-line argument) or perhaps parameter sharing might help them learn better.

It is likely that most of these algorithms will simply not perform well in Overcooked since they assume that they are acting in a single-agent environment so only their actions will impact the reward. However, since this is a cooperative setting, that assumption is broken, which is why rewards crash even though the policy/value losses are very low.

PPO with default parameters seems to do pretty well across most seeds that I've tested (>100 reward in general). Can you verify that PPO works?

momo-xiaoyi commented 2 years ago

Overcooked is just a representative environment with multi-agent (pettingzoo is similar). As I understand it, obtaining a reasonable strategy under partial observation conditions is one of the challenges in multi-agent research. And in my investigation, the existing research in overcooked mainly uses the tensorflow framework. Thanks to your team for trying to implement benchmarks under the pytorch framework. However, just PPO as a baseline might not be enough for my research.

PPO did pretty well in my test. I'm curious if the current package works well with Stable-Baselines3 - Contrib. https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

bsarkar321 commented 2 years ago

Ok, I think I can close this issue now, since the decline in rewards likely came from SB3's learning algorithms. As a side note, if there are other packages you would like to see implemented into PantheonRL, we would be happy to look into them. Currently, SB3 is only focused on single-agent RL algorithms, so all of our training is fully decentralized. We are currently working on adding centralized-training and decentralized-execution with the MAPPO algorithm, but we will also keep our eye out for any changes to SB3.

We do not currently plan on adding support for the stable-baselines3-contrib package, but from a quick glance it looks like most of the Agent code can still be maintained since these new algorithms still build off of OnPolicyAlgorithm and OffPolicyAlgorithm. However, they change the 'MlpPolicy' to different types which are not currently compatible with our codebase.