Open LunarEngineer opened 3 years ago
One idea:
When we were doing our own custom agent, we conceptually made this easier by breaking up the action into its constituent components: function IDs had their own net, location had its own net, radius had its own net. I think this takes care of flattening and symmetrize/normalization. We can bound the locations and radii by something reasonable like (0,100) or (0, 100 * sqrt(2)) and the function IDs component is just a finite-sized discrete one-hot vector, which should be no problem to handle.
So in other words, one solution is to have 3 A2C agents (or 2 A2C and 1 DQN if A2C doesn't like the discrete action space). The tradeoff here is that each new action component will depend on the current state (all actions dropped thus far) but will be calculated independently, unless we cascade the output of one into the input of another like we talked about.
Since we're settling on Stable Baseline 3 as our agent framework we should formalize a decision; Stable Baselines recommends the state space to be flattened, symmetric, and normalized. My thoughts are below and I would like some group input on what you all think.