adik993 / ppo-pytorch

Proximal Policy Optimization(PPO) with Intrinsic Curiosity Module(ICM)
133 stars 27 forks source link

How can I get your result in tensorboard without early ending? #2

Open GodZarathustra opened 5 years ago

GodZarathustra commented 5 years ago

I tried your script in mountaincar env and It seems that the game ends when the step length reaches 200 per episode, but in your tensorboard plots, an episode didn't stop until it reached the final state(the top of mountain). I wonder if it's because there is any early ending mechanisms in your code but unfortunately I didn't find it. Could you give me some advise to get your tensorboard result in your publishes?

adik993 commented 5 years ago

Hello @GodZarathustra There are a few things happening during training. I use the MultiEnv class to run multiple environments at the same time in different processes. Those environments don't stop when the environment is solved or timestep limit is hit they continue for the number of steps defined in Agent.learn/Agent.eval. In the run_mountain_car.py it's set to 256. After all the environments made that many steps the epoch is considered to be finished, so in the case of run_mountain_car.py since we run 16 parallel environments via MultiEnv and we run for n_steps=256 we make 256*16=4096 steps on the environments total per epoch.

In the TensorBoard there are few metrics that are reported at different intervals:

Having that said:

I tried your script in mountain car env and It seems that the game ends when the step length reaches 200 per episode

Only partially true, as I mentioned each environment run for exactly 256 steps, so in case the agent didn't solve the environment and hit the limit of 200 the agent will run next episode for another 56 steps and then will stop. Next round of 256 steps in the next epoch ofc

but in your tensorboard plots, an episode didn't stop until it reached the final state(the top of mountain).

If you refer to the env/reward then, as I mentioned all the environments report to it as they solve/hit the limit on the environment, so the plot should be like -200 at the beginning and slowly growing to -110 as agents learn.

I wonder if it's because there is any early ending mechanisms in your code but unfortunately I didn't find it.

Unfortunetely there is no earli stopping, at least yet :slightly_smiling_face:. It just run for the defined number of epochs: agent.learn(epochs=50, n_steps=256)

Could you give me some advise to get your tensorboard result in your publishes?

Sorry, I didn't get this question :disappointed:, but if you are referring to the results from the README.md it was taken from the Pendulum-v0 env.

I hope it helped. If you have any more questions or I totally missed your point do not hesitate to ask me :smiley:

GodZarathustra commented 5 years ago

you solved my confusion about the two concepts about update timestep limit and episode timestep limit, and thanks for your kind & detailed explanation. I just misunderstand the figure in README, and I just took it as MountainCar.. haha : ) so it's not because the number of step you run in each env, but the setting of timestep limit in MountainCar env itself, and it can be defined here here.

adik993 commented 5 years ago

I'm glad I could help 🙂