Confusing results in simple spread environment

I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.

from marllib import marl

# prepare env
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)

# initialize algorithm with appointed hyper-parameters
coma = marl.algos.coma(hyperparam_source='mpe')

# build agent model based on env + algorithms + user preference
model = marl.build_model(env, coma, {"core_arch": "gru", "encode_layer": "128-256"})

# start training
coma.fit(env, model, stop={'timesteps_total': 500000}, share_policy='group', checkpoint_freq=100000, checkpoint_end=True)

Replicable-MARL / MARLlib

Confusing results in simple spread environment #236