Farama-Foundation / PettingZoo

An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities
https://pettingzoo.farama.org
Other
2.56k stars 407 forks source link

Integration between PettingZoo and Tensorboard #363

Closed p-veloso closed 3 years ago

p-veloso commented 3 years ago

I created a custom parallel environment using the API. However, when I run the training session, there are no rewards on tensorboard. Is there any tutorial or specification on how to do that? Should I manually add callbacks to my environment code?

from stable_baselines3.ppo import MlpPolicy
from stable_baselines3 import PPO
import supersuit as ss
from petting_bubble_env_continuous import PettingBubblesEnvironment

args = [3, 3, 5, 20]
env = PettingBubblesEnvironment(*args)
env = ss.black_death_v1(env)
env = ss.pettingzoo_env_to_vec_env_v0(env)
env = ss.concat_vec_envs_v0(env, 8, num_cpus=1, base_class='stable_baselines3')

model = PPO(MlpPolicy, env, verbose=2, gamma=0.999, n_steps=1000, ent_coef=0.01, learning_rate=0.00025, vf_coef=0.5, max_grad_norm=0.5, gae_lambda=0.95, n_epochs=4, clip_range=0.2, clip_range_vf=1, tensorboard_log="./ppo_test/")
model.learn(total_timesteps=1000000, tb_log_name="test",  reset_num_timesteps=True)
model.save("bubble_policy_test")
jkterry1 commented 3 years ago

So this is almost certainly a bug in stable baselines 3. Can you confirm that the tensorboard logging works as desired with that code with a single agent environment (e.g. lunar landar in Gym)? If it doesn't you need to file an upstream issue there. If if does then I'd be happy to take a look.

p-veloso commented 3 years ago

When I run a simple single agent gym environment, I can see the graphs for the reward.

p-veloso commented 3 years ago

Here are the 2 files. petting bubble rl.zip

p-veloso commented 3 years ago

@justinkterry I investigated the code today. In stable baselines 3, when the episode is done, the vector with the environments stores data in the keys 'episode' and 'terminal_observation' of the info. Apparently, the class Monitor is a wrapper responsible for adding the episode information to info["episode"] at the end of the episode. Then, the method collect_rollouts in the OnPolicyAlgorithm and the method _update_info_buffer in its parent (BaseAlgorithm) use the data in the key 'episode' to update the buffer of the episode (self.ep_info_buffer). Finally, this data in this buffer is later used to update the rollout graphs to tensorboard during learning.

The reason that the problem might be related to PettingZoo or SuperSuit in my case, is because supersuit.vector.sb3_vector_wrapper.SB3VecEnvWrapper only returns the key 'terminal_observation' and not the 'episode', so stable baselines never send the reward information to tensorboard.

benblack769 commented 3 years ago

Thanks for figuring this out. I'll try to track down where these infos are coming from exactly and how we can do something similar in supersuit.

benblack769 commented 3 years ago

Ok, so there is an active PR for stable baselines 3 to add support for this. https://github.com/DLR-RM/stable-baselines3/pull/311

Since this PR is already so far along, I think I don't want to reproduce this feature in supersuit at the moment. I posted a comment noting that there is another request for this feature.

benblack769 commented 3 years ago

If you want to use this feature before the SB3 PR is merged, I think you should be able to install the PR's fork of stable baselines, and it will probably work as described in the PR.

p-veloso commented 3 years ago

Thanks for the replies. @weepingwillowben, I think the problem of using the files in that fork of SB3 is that those monitor wrappers (both Monitor and the new VecMonitor) are designed for gym environments, but I am working with a custom parallel environment based on the specifications of SuperSuit because I have multiple agents (check the files above).

As all the simulations have the same n of steps. my current approach is to store the reward data in the info key of agent 0 and them use a custom callback to update tensorboard. I know it is not robust, it does not look good, and is far from the ideal... but it seems to work.

class TensorboardCallback(BaseCallback):
    """
    Custom callback for plotting additional values in tensorboard.
    """
    def __init__(self, verbose=0):
        super(TensorboardCallback, self).__init__(verbose)

    def _on_step(self):
        k = list(self.training_env.venv.vec_envs[0].par_env.aec_env.infos.keys())[0]
        d = self.training_env.venv.vec_envs[0].par_env.aec_env.infos[k]
        if "done" in d and d["done"]:
            all_avg_rewards = np.zeros(len(self.training_env.venv.vec_envs))
            all_std_rewards = np.zeros(len(self.training_env.venv.vec_envs))
            for i in range(len(self.training_env.venv.vec_envs)):
                env = self.training_env.venv.vec_envs[i]
                all_avg_rewards[i] = env.par_env.aec_env.infos[k]["avg_rew"]
                all_std_rewards[i] = env.par_env.aec_env.infos[k]["std_rew"]
            logger.record('episode/avg_rewards', np.mean(all_avg_rewards))
            logger.record('episode/std_rewards', np.mean(all_std_rewards))
        return True
benblack769 commented 3 years ago

So the reason why the VecMonitor wrapper should work is that supersuit converts a pettingzoo multiagent environment into a gym single agent vector environment. This vector environment treats each agent as if it were in a separate environment (of course in reality they will be in the same environment). During training there is no explicit interaction between them except they share the same environment. Note that certain metrics, like the number of episodes, will be off by a factor of the number of agents. In other words it is a neat hack to implement one of the most popular multi-agent RL algorithms: parameter sharing of a single agent method.

To use the PR's code, you should be able to do this:

...
env = ss.concat_vec_envs_v0(env, 8, num_cpus=1, base_class='stable_baselines3')
env = VecMonitor(venv=env)

model = PPO(MlpPolicy, env, verbose=2, gamma=0.999, n_steps=1000, ent_coef=0.01, learning_rate=0.00025, vf_coef=0.5, max_grad_norm=0.5, gae_lambda=0.95, n_epochs=4, clip_range=0.2, clip_range_vf=1, tensorboard_log="./ppo_test/")
...
p-veloso commented 3 years ago

@weepingwillowben, this is exactly what I need. Thanks again.

araffin commented 3 years ago

https://github.com/DLR-RM/stable-baselines3/pull/311 is now merged with master