facebookresearch / BenchMARL

A collection of MARL benchmarks based on TorchRL
https://benchmarl.readthedocs.io/
MIT License
248 stars 33 forks source link

Beginner questions #92

Closed florin-pop closed 3 months ago

florin-pop commented 3 months ago

Hello,

I'll start with a disclaimer saying that I am a novice when it comes to reinforcement learning and RL frameworks. My goal is to determine if applying specific structural changes to the MultiAgentMLP and to the loss in various algorithms would lead to different outcomes for specific tasks. For this purpose I wanted to work with an environment where the agent reward has a shared and individual component, have the agents not share parameters or critics or observations. I think that I managed to select a baseline this by using the simple_reference environment and setting share_policy_params: False, share_param_critic: False and using the MADDPG/IDDPG algorithm.

However, I am confronted with 2 questions: 1) What would be the best approach to implement logging per agent so that we can see individual losses and rewards even if the agents are part of the same group? 2) Did I understand correctly that in the DDPG loss, the loss_actor for example is first computed per agent (as it has the shape batch_size x n_agents) and then it's reduced to an average which is a single number that is used to calculate the gradients for both agent's MLPs? I was guessing that each agent would have its own loss component drive its MLP's gradients.

matteobettini commented 3 months ago

Hey!

Thanks for reaching out, feel free also to use the discord link in the readme if you want to PM me directly.

  1. What would be the best approach to implement logging per agent so that we can see individual losses and rewards even if the agents are part of the same group?

For rewards and other data that comes from collection, I normally use the callbacks to log custom stuff or per-agent things. For example in on_batch_collected you can do something like this

    def on_batch_collected(self, batch: TensorDictBase):
        for group in self.experiment.group_map.keys():
            keys_to_log = [
                (group, "logits"),
                (group, "observation"),
                (group, "reward"),
                (group, "estimated_snd"),
                (group, "scaling_ratio"),
            ]
            to_log = {}

            for key in keys_to_norm:
                value = batch.get(key)
                for i,agent in enumerate(self.experiment.group_map[group]):
                    to_log.update(
                        {f"{key}_{agent}": torch.mean(value[...,i,:]).item()}
                    )
            self.experiment.logger.log(
                to_log,
                step=self.experiment.n_iters_performed,
            )

For losses, there is no way to log individual lossess (unless you modify the loss class) as they are aggregated within the loss class itself. My suggestion here, if you really want to log individual losses is to put agents in different groups.

  1. Did I understand correctly that in the DDPG loss, the loss_actor for example is first computed per agent (as it has the shape batch_size x n_agents) and then it's reduced to an average which is a single number that is used to calculate the gradients for both agent's MLPs? I was guessing that each agent would have its own loss component drive its MLP's gradients.

You seem to understand correctly. However, averaging them toghether won't cause any mixing issues. This is because summing values coming from different networks and then taking the gradient of that value with respect to each individual network will discard the contributions from the other elements in the sum.

if you have $L = f\theta(x) + g\eta(x)$ then $\frac{\partial L}{\partial \theta} = \frac{\partial f}{\partial \theta}$ and $\frac{\partial L}{\partial \eta} = \frac{\partial g}{ \partial \eta}$

If you really want to use different loss classes per each agent, then put each agent in a different group

florin-pop commented 3 months ago

Thank you Matteo, on_batch_collected worked beautifully at least for logging episode_reward. The derivative explanation makes perfect sense. I realized that putting each agent in a different group, at least in case on MADDPG, would be wrong because the agents wouldn't be able to share experience.

matteobettini commented 3 months ago

Yeah that is the catch... but if it is just for debugging it won't matter much. I'll close this now, feel free to reopen if you have further problems.