Open cwbeitel opened 6 years ago
This is great.
Were you able to train a good model as demonstrated by the model working? How long did it take the model to train?
For tensorboard, what is the TfJob manifest?
A couple things
Why are you piping the output of ks to kubectl?
ks show gke -c agents-ppo | kubectl create -f -
We'd like the example to be runnable from JupyterHub deployed on Kubeflow. Here's a bunch of issues I think we will need to address. But these can be postponed until after we get a model working
We need a solution for building the docker images from Kubeflow
Configuring ks & kubectl to work in cluster might be an issue.
I have a model training now with the full set of parameters @danijar suggested initially.
Committed a job yaml that didn't launch a tensorboard deployment.
Not sure why yet but ks apply gke -c agents-ppo
wasn't creating a tfjob whereas I could take the yaml produced from ks show gke -c agents-ppo
and pass it to kubectl create -f -
and it would work fine. So that's there as a stopgap until figuring out why ks apply gke -c agents-ppo
alone isn't working.
Acknowledged on the value of running on jhub on kflow and the issues that come with that.
So rendering is working now but the duration of the render is very short, see example. I expected increasing the max_length parameter would increase the episode length given https://github.com/tensorflow/agents/blob/7d475974b417a7827bfb06dac1d094278815c1f7/agents/scripts/visualize.py#L54, maybe @danijar has a suggestion? I let training run with his provided parameters until it plateaued
Currently the episode length is set to 1000 for training and 10000 during eval but I'm wondering if the episode length needs to be the same for both? Re-stated training with an episode length of 10k but this is obviously far slower.
When I trained PPO on this environment, it reached a score of >2000 after 2.5 M steps. Here is a video of the evaluation. But with a score of ~600, the ant should at least walk slowly for a while.
1000 steps is the correct maximum length of the episode. The control frequency is 20 Hz, resulting in an episode of 50 seconds. If the ant gets into a bad position, for example flips over, the episode terminates early.
In rare cases, I have experienced that Bullet's Ant environment initializes the ant in a position that almost immediately terminates the episode. Could you render a few more videos of the trained agent to see whether this was the case here?
Put a bunch more in gs://kubeflow-rl-k8s/renders but they're all that short. Started runs with those original parameters I'll let run out to 2.5M steps.
Unless there's somewhere to tune the control frequency then I'd be curious if the version of the rendering-related libraries might be the issue. Can you list the versions of all the agents dependencies (including ffmpg and libav-tools) as well as your OS? (I'll note that I'm curious about this because ) It might be good to pin the dependencies in setup.py to specific versions that are known to work correctly. It could also be a change in gym.
Oh I was in the process of saying I'm curious about the version of ffmpg and libav-tools because if you look closely at that first render it has some visual artifacts (horizontal lines near top)...
The horizontal line near the top steps from the running track in the environment, that is far away. One problem could come from rendering multiple episodes in parallel. Are you doing that? Gym's rendering wrapper doesn't cope with concurrent access well. To verify that you're dependencies work correctly, you could try:
import pybullet_envs
env = gym.make('AntBulletEnv-v0')
env = gym.wrappers.Monitor(env, '/path/to/outdir', lambda _: True)
done = False
env.reset()
while not done:
_, _, done, _ = env.step(env.action_space.sample())
Ah I see thank you! One of the issues was that in the visualize call the logdir was being set to a subdir of the logdir which didn't contain any checkpoints 😄
Now it is able to generate videos of 16+s, e.g. these, (more if increasing the max_length). What it looks like is happening is the agent is getting stuck and barely moves. But I'm guessing it just hasn't trained long enough.
Below are tboard scalar plots for training runs that ran for over 24h and both reached over 1.2M train phase steps. One was trained with 10 agents and the other with 30 with pink and green being the 30 agent case train and eval respectively
I need a minute to get up to speed to interpret the plots but on first take it looks problematic that the memory_size bottomed out in the 10x case, don't know why it's so variable in 30x case, and that the performance is increasing more stably in the 30 agent case. It would be cool to try with more than 30x on the rollouts (correct language?) and experiment with other network parameters as well as train with numerous independent replicas and pick the best. But it would be super cool to have a more intelligent way to optimize hyperparameters.
There have been some issues with runs getting evicted due to memory conflicts (including the 30x case above) and should be able to more reliably do long runs in the future by specifying resource requests and limits.
This doesn't look too bad. The mean rewards (normalize_reward/summary/cond/mean) is continually growing, which means that the agent trains and hasn't converged yet. For me, it took 2.5 M steps (which also varies a bit between runs), and the plots you show are at 1.6 M steps so far.
However, I am surprised that this progress took 24 hours to train. How many CPU cores does the server have? The most important parameter for maximum throughput is num_agents
. Could you try out some other environments, for example pendulum and cheetah, to see if this is caused by the environment or the agent, please?
The memory size should incrementally grow from 0 to update_every
, after which the agent trains on the data and clears the memory. The more num_agents
, the faster the memory is filled it is quite possible that the logging frequency is too low to actually see the memory growing in the 30 agent case.
@cwbeitel How big are the nodes in your cluster? It looks like you are using 8 CPU machines which seems pretty wimpy.
Yeah I can change over to larger machines. I'll also switch on the autoscaler.
And yeah I can try some other environments as well, looks like those are included in pybullet.
Ah I see that makes sense about logging frequency vs. num_agents.
Changed over to n1-standard-32 nodes, enabled autoscaler, started runs of HalfCheetahBulletEnv-v0, InvertedPendulumBulletEnv-v0, KukaBulletEnv-v0, and a continuation of AntBulletEnv-v0 in gs://kubeflow-rl-k8s/jobs/20180112T052347-pybullet_ant_ff, all with num_agents=30. The latter covered 93k train steps in 43min so I'm guessing the delay before was a combination of thrashing on 30x env.'s on 8 core machines and memory evictions from the tasks themselves as well as co-occupants.
Great to hear that you are using more cores now. 100k steps in 40 mins still sounds very slow to me though. On my local workstation with 6 CPU cores, HalfCheetah-v1 (the MuJoCo version which is faster, not the Bullet version) trains for 10M frames (20M for train and eval combined) in one hour, which is about 5500 steps per second.
Oh sweet that's pretty fast! 🏎🏎🏎
About the difference in training speed
mean_score for the four envs over about 8h (click to enlarge)
Observations and thoughts
Renders are here:
Hey so idk if people actually see comments in their notifications unless they're mentioned so @danijar @jlewi otherwise no rush in responding ✌️
Simplification to the demo notebook with presumption for now that (1) users will build containers elsewhere and (2) rendering will happen as part of a training job (which it doesn't yet) by running in the train_and_render --mode.
Logging to a different server might well be the cause of the slow down. You could verify that by starting a run without specifying a logdir, so that not summaries and checkpoints are written.
Sounds good I'll try that. One thing to note is that it looks like the count on global steps/second increases proportionally to the number of agents being used for rollouts. So if our num_agents params are different we'd see a different step rate.
In which case we'd also end up running eval and writing logs overly frequently. That might be/also be the issue.
This version of the demo notebook illustrates the doubling of step rate with doubling num_agents (as well as the not fetchable error).
One thing to have in mind when setting the number of agents is that the algorithm will not use more agents than episodes needed for an update. For most environments, the number of episodes per update is update_every = 30
. Ideally, the number of agents should divide this number to not discard the remaining data, e.g. 10, 15, or 30. You can also increase update_every
which will result in more stable but less data-efficient updates. This might allow scaling higher and using a larger learning rate (or the K-FAC optimizer!) to still benefit from the large batches.
Regarding the short video clips, did this problem occur for all environments? I just had the same problems when trying to render videos for MuJoCo's Humanoid-v1 and AntBulletEnv-v0. It might be something broken in the newest version of Gym?
The problem resolved when specifying the correct log_dir (containing model checkpoints). I think the problem was coming from attempting to render without any model parameters so ffmpg would start writing the stream being produced by the environment before there was a crash due to there being no checkpoints in the provided logdir?
I'm getting the problem for valid log directories; created an https://github.com/openai/gym/issues/832.
Current draft of the notebook is here
Per @jlewi's suggestions using jsonnet to template tf-job's instead of jinja2.
Depends on being able to render which worked previously but currently is repeating the following error: