Open nick-harder opened 1 year ago
This tasks has been given low priority as other issues need to be adressed first
The general structure is ready on the PPO branch and runnable with one gradient step. However, the conversion for a single agent seems to get stuck in extreme values, so nothing too valuable is learned.
We should start working on a new DRL algorithm based on MA PPO algorithm, it promises significant speed improvements, and would solver the critique of the centralized critic approach