Question about PPO - Githubissues

asyua-ye commented 1 week ago

Hi, I’ve been using PPO to train an agent, and I noticed that the agent’s performance fluctuates even when it seems to have found an optimal policy. Specifically, after 200 episodes, the success rate doesn’t stabilize at 100%. I thought reinforcement learning methods should eventually converge to an optimal policy. Previously, I got a relatively good agent using checkpoints, but I’m curious if this could be an issue with my implementation or hyperparameter settings. When you train with PPO, does your agent eventually stabilize at the optimal policy?

Thanks!

CVHvn commented 1 week ago

Unfortunately, I use early stopping (stop training) when my agent wins the game in test mode because Mario doesn’t have state transition probabilities (in one state, one action always yields one next state, not a random next state with some probability. Sometimes this environment bugs, but we don’t need to worry about that :( ).

If you are talking about success rate when testing (without any random or distribution sampling). Mario doesn’t have state transition probabilities (in one state, one action always yields one next state, not a random next state with some probability. Sometimes this environment bugs, but we don’t need to worry about that :( ). If your agent win 1 time, it always win (100% win rate).

If you are mentioning about success rate when training. with my current knowledge, there are several reasons why the success rate does not reach 100%:

Because when training we random sampling actions from actor distribution, maybe your agent just fail because they random poor action (with very small probability).
The success rate doesn’t stabilize at 100% because the agent will try to improve itself to yeild more return. Sometimes it learns new suboptimal strategies but yields better rewards (maybe the agent is exploring this strategy) :3. However, if we continue training, the success rate will increase (though it takes a lot of time).
About optimal policy, because Deep Reinforcement Learning only learns the policy based on the set of states and actions it tries during training, it can't learn everything about the environment. Therefore, it can't know enough to find the optimal policy!

asyua-ye commented 1 week ago

Thank you for your reply. I modified the action space, which caused my agent to not consistently select the optimal path during testing, and this has been a troubling issue for me. I will conduct some experiments to see how I can resolve this. Perhaps I need to find a method that ensures the agent selects the best action during testing instead of sampling. Thank you again for your response.

CVHvn commented 1 week ago

I think you only need set deterministic = True when testing (set deterministic = True in get_action, select_action, maybe_evaluate_and_print, toTest functions). Then your agent will not random from binary distribution, just choice action if prob > threshold (prob > 0.5).

CVHvn / Mario_PPO_RND

Question about PPO #1