Dimensions of `world_state` and `reward` do not match

https://github.com/FLAIROx/JaxMARL/blob/d941770bb1bf6945412c70c59c509226d9d39628/baselines/MAPPO/mappo_rnn_mpe.py#L292C1-L295C33

Each agent gets (more or less) the same reward at each step of the environment. In the output of _env_state, it looks like the batched rewards are shaped as [(agent1, env1), (agent1, env2).....(agent2, env1), (agent2, env2)......], but the batched world_states are shaped as [(agent1, env1), (agent2, env1), (agent3, env1), (agent1, env2) .....] ((agent1, env1) = (agent2, env1) = (agent3, env1)).

This affects advantage calculation later, because last_val inherits the shape of world_states, and now you're matching the wrong reward to the values.

To reproduce:

Run with jax.disable_jit(True)
Set breakpoint at line 341, and compare traj_batch.reward[-1, :] and traj_batch.reward[-1, :].reshape((3, -1)) with traj_batch.world_state[-1, :, :4] or something similar

Solution: should be just changing the order option for jnp.reshape

FLAIROx / JaxMARL

Dimensions of `world_state` and `reward` do not match #85