[x] Should rewards be computed based on observations or states? Currently, rewards are obtained from observations, but formal MDP's suggest states. Also with hindsight to multi agent behaviour, rewards should be probably obtained with respect to the global state (racing example: ranking based rewards)
[ ] Right now, tasks are assigned globally (each agent tries to solve the same task). It might be better to assign tasks individually.
Issues to solve/consider: