Open GodZarathustra opened 5 years ago
Hello @GodZarathustra
There are a few things happening during training. I use the MultiEnv
class to run multiple environments at the same time in different processes. Those environments don't stop when the environment is solved or timestep limit is hit they continue for the number of steps defined in Agent.learn
/Agent.eval
. In the run_mountain_car.py
it's set to 256
. After all the environments made that many steps the epoch is considered to be finished, so in the case of run_mountain_car.py
since we run 16
parallel environments via MultiEnv
and we run for n_steps=256
we make 256*16=4096
steps on the environments total per epoch.
In the TensorBoard there are few metrics that are reported at different intervals:
env/reward
is reported in MultiEnv._report_steps
and it reports reward on the game end, that is either timestep limit is hit(200
in case of MountainCar-v0
) or environment is solved(car reached the top). For this metric, you will see probably around 1k values recorded because we have 16
envs running 256
steps for 50
. If you divide that by the average number of timesteps agent took to solve env you will get the 1k number :smile: icm/reward
is reported in Agent.learn
at each episode end so it's gonna be 50
n_optimization_epochs
and for each minibatchHaving that said:
I tried your script in mountain car env and It seems that the game ends when the step length reaches 200 per episode
Only partially true, as I mentioned each environment run for exactly 256
steps, so in case the agent didn't solve the environment and hit the limit of 200
the agent will run next episode for another 56
steps and then will stop. Next round of 256
steps in the next epoch ofc
but in your tensorboard plots, an episode didn't stop until it reached the final state(the top of mountain).
If you refer to the env/reward
then, as I mentioned all the environments report to it as they solve/hit the limit on the environment, so the plot should be like -200
at the beginning and slowly growing to -110
as agents learn.
I wonder if it's because there is any early ending mechanisms in your code but unfortunately I didn't find it.
Unfortunetely there is no earli stopping, at least yet :slightly_smiling_face:. It just run for the defined number of epochs: agent.learn(epochs=50, n_steps=256)
Could you give me some advise to get your tensorboard result in your publishes?
Sorry, I didn't get this question :disappointed:, but if you are referring to the results from the README.md
it was taken from the Pendulum-v0
env.
I hope it helped. If you have any more questions or I totally missed your point do not hesitate to ask me :smiley:
you solved my confusion about the two concepts about update timestep limit and episode timestep limit, and thanks for your kind & detailed explanation. I just misunderstand the figure in README, and I just took it as MountainCar.. haha : ) so it's not because the number of step you run in each env, but the setting of timestep limit in MountainCar env itself, and it can be defined here here.
I'm glad I could help 🙂
I tried your script in mountaincar env and It seems that the game ends when the step length reaches 200 per episode, but in your tensorboard plots, an episode didn't stop until it reached the final state(the top of mountain). I wonder if it's because there is any early ending mechanisms in your code but unfortunately I didn't find it. Could you give me some advise to get your tensorboard result in your publishes?