PKU-MARL / HARL

Official implementation of HARL algorithms based on PyTorch.
521 stars 64 forks source link

What is the difference between FP and EP? #49

Closed lordemyj closed 1 month ago

lordemyj commented 3 months ago
    if self.state_type == "EP":
        data = (
            share_obs[:, 0],  # (n_threads, share_obs_dim)
            obs,  # (n_agents, n_threads, obs_dim)
            actions,  # (n_agents, n_threads, action_dim)
            available_actions,  # None or (n_agents, n_threads, action_number)
            rewards[:, 0],  # (n_threads, 1)
            np.expand_dims(dones_env, axis=-1),  # (n_threads, 1)
            valid_transitions.transpose(1, 0, 2),  # (n_agents, n_threads, 1)
            terms,  # (n_threads, 1)
            next_share_obs[:, 0],  # (n_threads, next_share_obs_dim)
            next_obs.transpose(1, 0, 2),  # (n_agents, n_threads, next_obs_dim)
            next_available_actions,  # None or (n_agents, n_threads, next_action_number)
        )
    elif self.state_type == "FP":
        data = (
            share_obs,  # (n_threads, n_agents, share_obs_dim)
            obs,  # (n_agents, n_threads, obs_dim)
            actions,  # (n_agents, n_threads, action_dim)
            available_actions,  # None or (n_agents, n_threads, action_number)
            rewards,  # (n_threads, n_agents, 1)
            np.expand_dims(dones, axis=-1),  # (n_threads, n_agents, 1)
            valid_transitions.transpose(1, 0, 2),  # (n_agents, n_threads, 1)
            terms,  # (n_threads, n_agents, 1)
            next_share_obs,  # (n_threads, n_agents, next_share_obs_dim)
            next_obs.transpose(1, 0, 2),  # (n_agents, n_threads, next_obs_dim)
            next_available_actions,  # None or (n_agents, n_threads, next_action_number)
        )

When self.state_type == "EP", why is only the reward of the first agent taken rewards[:, 0], and why the reward of the second agent ignored?

Ivan-Zhong commented 2 months ago

Hi, sorry for the late reply. EP and FP are first introduced in the MAPPO paper (Figure 4). EP stands for environment-provided global state and it provides the same global state input to the critic for all actors. FP is the agent-specific global state and it produces different global state inputs to the critic for all actors. Thus, for the data related to critic in FP, we always have an extra dimension of n_agents to maintain different inputs. As for the rewards, since we consider the fully cooperative scenarios, all agents' rewards are the total reward they receive. Thus, in EP we only save the reward of the first agent, while in FP we save the rewards of all agents for the convenience of data processing.