Open pengzhenghao opened 1 year ago
See:
https://github.com/ermongroup/MA-AIRL/blob/master/multi-agent-irl/irl/mack/airl.py#L168
def train(obs, states, rewards, masks, actions, values): advs = [rewards[k] - values[k] for k in range(num_agents)]
The advantage is computed as reward_t + gamma * value(state_t+1) - value(state_t) in most RL algorithms. Why here we use this form?
reward_t + gamma * value(state_t+1) - value(state_t)
See:
https://github.com/ermongroup/MA-AIRL/blob/master/multi-agent-irl/irl/mack/airl.py#L168
The advantage is computed as
reward_t + gamma * value(state_t+1) - value(state_t)
in most RL algorithms. Why here we use this form?