CVHvn / Mario_PPO_RND

Playing Super Mario Bros with Proximal Policy Optimization (PPO) and Random Network Distillation (RND)
MIT License
2 stars 0 forks source link

Question about PPO #1

Closed asyua-ye closed 1 week ago

asyua-ye commented 1 week ago

Hi, I’ve been using PPO to train an agent, and I noticed that the agent’s performance fluctuates even when it seems to have found an optimal policy. Specifically, after 200 episodes, the success rate doesn’t stabilize at 100%. I thought reinforcement learning methods should eventually converge to an optimal policy. Previously, I got a relatively good agent using checkpoints, but I’m curious if this could be an issue with my implementation or hyperparameter settings. When you train with PPO, does your agent eventually stabilize at the optimal policy?

Thanks!

CVHvn commented 1 week ago

Unfortunately, I use early stopping (stop training) when my agent wins the game in test mode because Mario doesn’t have state transition probabilities (in one state, one action always yields one next state, not a random next state with some probability. Sometimes this environment bugs, but we don’t need to worry about that :( ).

If you are talking about success rate when testing (without any random or distribution sampling). Mario doesn’t have state transition probabilities (in one state, one action always yields one next state, not a random next state with some probability. Sometimes this environment bugs, but we don’t need to worry about that :( ). If your agent win 1 time, it always win (100% win rate).

If you are mentioning about success rate when training. with my current knowledge, there are several reasons why the success rate does not reach 100%:

asyua-ye commented 1 week ago

Thank you for your reply. I modified the action space, which caused my agent to not consistently select the optimal path during testing, and this has been a troubling issue for me. I will conduct some experiments to see how I can resolve this. Perhaps I need to find a method that ensures the agent selects the best action during testing instead of sampling. Thank you again for your response.

CVHvn commented 1 week ago

I think you only need set deterministic = True when testing (set deterministic = True in get_action, select_action, maybe_evaluate_and_print, toTest functions). Then your agent will not random from binary distribution, just choice action if prob > threshold (prob > 0.5).