During training in a custom environment with MaskablePPO, the reward decreased and then converged. Is there any specific reason? It means the algorithm has found a better policy but is outputting another one?
My environment has two normalized rewards that will be weighted sum to measure the final reward. I have 19 timestep and my gamma was set to 0.001.
❓ Question
Hi,
During training in a custom environment with MaskablePPO, the reward decreased and then converged. Is there any specific reason? It means the algorithm has found a better policy but is outputting another one?
My environment has two normalized rewards that will be weighted sum to measure the final reward. I have 19 timestep and my gamma was set to 0.001.
class customenv(gym.Env):....
env = customenv()
env = ActionMasker(env, mask_fn)
model = MaskablePPO(MaskableActorCriticPolicy, env, gamma = 0.0001, verbose=0)
model.learn(4000000)
Thank you!
Checklist