Metro1998 / hppo-in-traffic-signal-control

37 stars 3 forks source link

some question about hybrid action space #3

Closed acezsq closed 9 months ago

acezsq commented 9 months ago

First of all, thank you very much for open-sourcing your code. I would like to apply the HPPO algorithm to my own environment, so I have some questions about your code and I hope you can answer them: ● I have some questions regarding the definition of the hybrid action space in the gym environment you used. I haven't used your traffic-related environment before. Based on my understanding of the mixed action space defined in the environments/HybridIntersection_v1.py file, the first layer consists of 8 discrete actions, and the second layer corresponds to the continuous values that each action can take. In the env.step(action)function, I assume that we first obtain the discrete action using stage_next = action[0], and then obtain the specific value of the action execution using action[1][stage_next]. Is my understanding correct? ● I understand that the environment you used is a multi-agent task environment. In the train_supplement_revision.py code, there is a line [self.push_history_hybrid(i, observations[i], actions[0][i], value_action_logp[i][1][1], logp_actions[0][i], logp_actions[1][i], values[i]) for i in range(self.num_agent) if agents_to_update[i] == 1]. If I want to change it to a single-agent setting without collecting data through multiple parallel environments, what should I pay attention to? ● Is the current code complete? train_supplement_revision.py defines the training process, and PPO_family.py defines HPPO.

Metro1998 commented 9 months ago

我觉得你应该是国人,所以我用中文回答一下。 1.你的理解没有问题,(continuous) actor 最终的heads数量与discrete action 的维度相同,其意思就是假设我下一阶段是选择stage_next 那么他所对应的持续时间是多少。我在深入一下,其实这样做是可以被批评的(有改善的空间),Hyar就没有approximate整个动作空间。 2.其实本质上不会有什么影响。针对于obs,他的维度可以是(环境数,智能体个数,观测本身的维度),如果你是single agent那么智能体的数量设置为1,或者这个维度干脆不要就行。你在设计buffer或者是计算adv的时候可能会要注意一下。 3.代码是不完整的,另外我最近也在做我的毕业设计,我会在之后更新比较stable的main.py以及基于envpool的环境的引擎(年后?)。在此之前我都推荐你用自己引擎(自己基于sumo实现或者是cityflow)。这个代码仓库主要就是让大家可以参考一下hppo的实现。

acezsq commented 9 months ago

哈哈哈,感谢回复。我在做一个机器人的控制器,需要设计混合动作空间结合PPO算法训练。看您的代码确实主要想参考一下HPPO的实现,因为网上关于HPPO的代码实现几乎没有。感谢感谢!

Metro1998 commented 9 months ago

不客气,另外推荐Hyar,那个可以兼容更多算法,同时有着更小的动作空间。