"MulBackward0 returned nan values" error when launch HATRPO after HAPPO

Hello. Thanks for your work. I tried to use MARLlib for training in my custom environment with different algorithms. It seems that if start training with HATRPO immediately after HAPPO "RuntimeError: Function 'MulBackward0' returned nan values in its 0th output" error will occur. The following code can be used to reproduce error:

from marllib import marl

env = marl.make_env("gymnasium_mpe", "simple_spread")
algo = marl.algos.happo(hyperparam_source="common")
model = marl.build_model(env, algo, {"core_arch": "mlp"})
algo.fit(env, model, stop={'timesteps_total': 1000})

env = marl.make_env("gymnasium_mpe", "simple_spread")
algo = marl.algos.hatrpo(hyperparam_source="common")
model = marl.build_model(env, algo, {"core_arch": "mlp"})
algo.fit(env, model, stop={'timesteps_total': 1000})

Also attaching full log. Installation of MARLlib were made with conda, no GPU used, launched in local mode and not reproduce if local_mode=False.

Replicable-MARL / MARLlib

"MulBackward0 returned nan values" error when launch HATRPO after HAPPO #219