PKU-MARL / HARL

Official implementation of HARL algorithms based on PyTorch.
484 stars 59 forks source link

question on the line 120 of off_policy_ha_runner.py #29

Closed zhangzhang80 closed 8 months ago

zhangzhang80 commented 9 months ago

As I understand, the line 120 "value_pred = self.critic.get_values(sp_share_obs, actions_t)" is used to estimate the Q value for each agent's current action. When "self.state_type == "FP":" , I wonder whether the "sp_share_obs" in the line 120 should be set as "sp_share_obs(agent_id, :)" , since in the setting of FP, the agent-specific global state should be used?

guazimao commented 9 months ago

Hi. You can take a look at line 58 of off_policy_buffer_fp.py. We have transformed the shape of sp_share_obs from (n_agents, batch_size, dim) to (n_agents batch_size, *dim), essentially treating each agent's distinct global state as a training sample.

zhangzhang80 commented 9 months ago

Thank you for your kind explanations. So, although the policy model is optimized for each agent one by one (line 99 in off_policy_ha_runner), the actor_loss for a given agent_id also depends on other agents' value_pred. Is it right?

guazimao commented 8 months ago

Yep.