What is the difference between FP and EP？

    if self.state_type == "EP":
        data = (
            share_obs[:, 0],  # (n_threads, share_obs_dim)
            obs,  # (n_agents, n_threads, obs_dim)
            actions,  # (n_agents, n_threads, action_dim)
            available_actions,  # None or (n_agents, n_threads, action_number)
            rewards[:, 0],  # (n_threads, 1)
            np.expand_dims(dones_env, axis=-1),  # (n_threads, 1)
            valid_transitions.transpose(1, 0, 2),  # (n_agents, n_threads, 1)
            terms,  # (n_threads, 1)
            next_share_obs[:, 0],  # (n_threads, next_share_obs_dim)
            next_obs.transpose(1, 0, 2),  # (n_agents, n_threads, next_obs_dim)
            next_available_actions,  # None or (n_agents, n_threads, next_action_number)
        )
    elif self.state_type == "FP":
        data = (
            share_obs,  # (n_threads, n_agents, share_obs_dim)
            obs,  # (n_agents, n_threads, obs_dim)
            actions,  # (n_agents, n_threads, action_dim)
            available_actions,  # None or (n_agents, n_threads, action_number)
            rewards,  # (n_threads, n_agents, 1)
            np.expand_dims(dones, axis=-1),  # (n_threads, n_agents, 1)
            valid_transitions.transpose(1, 0, 2),  # (n_agents, n_threads, 1)
            terms,  # (n_threads, n_agents, 1)
            next_share_obs,  # (n_threads, n_agents, next_share_obs_dim)
            next_obs.transpose(1, 0, 2),  # (n_agents, n_threads, next_obs_dim)
            next_available_actions,  # None or (n_agents, n_threads, next_action_number)
        )

When self.state_type == "EP", why is only the reward of the first agent taken rewards[:, 0]， and why the reward of the second agent ignored?

Hi, sorry for the late reply. EP and FP are first introduced in the MAPPO paper (Figure 4). EP stands for environment-provided global state and it provides the same global state input to the critic for all actors. FP is the agent-specific global state and it produces different global state inputs to the critic for all actors. Thus, for the data related to critic in FP, we always have an extra dimension of n_agents to maintain different inputs. As for the rewards, since we consider the fully cooperative scenarios, all agents' rewards are the total reward they receive. Thus, in EP we only save the reward of the first agent, while in FP we save the rewards of all agents for the convenience of data processing.

PKU-MARL / HARL

What is the difference between FP and EP？ #49