I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.
from marllib import marl
# prepare env
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)
# initialize algorithm with appointed hyper-parameters
coma = marl.algos.coma(hyperparam_source='mpe')
# build agent model based on env + algorithms + user preference
model = marl.build_model(env, coma, {"core_arch": "gru", "encode_layer": "128-256"})
# start training
coma.fit(env, model, stop={'timesteps_total': 500000}, share_policy='group', checkpoint_freq=100000, checkpoint_end=True)
Are you plotting the episode_reward_mean or episode_reward_max? I suspect that the "reward" in the results csv is the ray/tune/episode_reward_max, but I may be wrong.
I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.