ermongroup / MA-AIRL

Multi-Agent Adversarial Inverse Reinforcement Learning, ICML 2019.
186 stars 28 forks source link

Why the advantage is computed as "reward - value" instead of "reward + gamma*next value - value" #7

Open pengzhenghao opened 1 year ago

pengzhenghao commented 1 year ago

See:

https://github.com/ermongroup/MA-AIRL/blob/master/multi-agent-irl/irl/mack/airl.py#L168

        def train(obs, states, rewards, masks, actions, values):
            advs = [rewards[k] - values[k] for k in range(num_agents)]

The advantage is computed as reward_t + gamma * value(state_t+1) - value(state_t) in most RL algorithms. Why here we use this form?