cwbeitel / kubeflow-rl

Demonstrations of RL on kubeflow
Apache License 2.0
9 stars 0 forks source link

Demonstrate the agents_ppo example using a Jupyter notebook #2

Open cwbeitel opened 6 years ago

cwbeitel commented 6 years ago

Current draft of the notebook is here

Per @jlewi's suggestions using jsonnet to template tf-job's instead of jinja2.

Depends on being able to render which worked previously but currently is repeating the following error:

2018-01-10 01:13:56.266849: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation 
failed and will be automatically retried in 32.8255 seconds (attempt 8 out of 10), caused by: 
Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 
'Callback aborted')
jlewi commented 6 years ago

This is great.

Were you able to train a good model as demonstrated by the model working? How long did it take the model to train?

For tensorboard, what is the TfJob manifest?

A couple things

Why are you piping the output of ks to kubectl?

ks show gke -c agents-ppo |  kubectl create -f -

We'd like the example to be runnable from JupyterHub deployed on Kubeflow. Here's a bunch of issues I think we will need to address. But these can be postponed until after we get a model working

cwbeitel commented 6 years ago

I have a model training now with the full set of parameters @danijar suggested initially.

Committed a job yaml that didn't launch a tensorboard deployment.

Not sure why yet but ks apply gke -c agents-ppo wasn't creating a tfjob whereas I could take the yaml produced from ks show gke -c agents-ppo and pass it to kubectl create -f - and it would work fine. So that's there as a stopgap until figuring out why ks apply gke -c agents-ppo alone isn't working.

Acknowledged on the value of running on jhub on kflow and the issues that come with that.

cwbeitel commented 6 years ago

So rendering is working now but the duration of the render is very short, see example. I expected increasing the max_length parameter would increase the episode length given https://github.com/tensorflow/agents/blob/7d475974b417a7827bfb06dac1d094278815c1f7/agents/scripts/visualize.py#L54, maybe @danijar has a suggestion? I let training run with his provided parameters until it plateaued

plateau

Currently the episode length is set to 1000 for training and 10000 during eval but I'm wondering if the episode length needs to be the same for both? Re-stated training with an episode length of 10k but this is obviously far slower.

danijar commented 6 years ago

When I trained PPO on this environment, it reached a score of >2000 after 2.5 M steps. Here is a video of the evaluation. But with a score of ~600, the ant should at least walk slowly for a while.

1000 steps is the correct maximum length of the episode. The control frequency is 20 Hz, resulting in an episode of 50 seconds. If the ant gets into a bad position, for example flips over, the episode terminates early.

In rare cases, I have experienced that Bullet's Ant environment initializes the ant in a position that almost immediately terminates the episode. Could you render a few more videos of the trained agent to see whether this was the case here?

cwbeitel commented 6 years ago

Put a bunch more in gs://kubeflow-rl-k8s/renders but they're all that short. Started runs with those original parameters I'll let run out to 2.5M steps.

Unless there's somewhere to tune the control frequency then I'd be curious if the version of the rendering-related libraries might be the issue. Can you list the versions of all the agents dependencies (including ffmpg and libav-tools) as well as your OS? (I'll note that I'm curious about this because ) It might be good to pin the dependencies in setup.py to specific versions that are known to work correctly. It could also be a change in gym.

cwbeitel commented 6 years ago

Oh I was in the process of saying I'm curious about the version of ffmpg and libav-tools because if you look closely at that first render it has some visual artifacts (horizontal lines near top)...

danijar commented 6 years ago

The horizontal line near the top steps from the running track in the environment, that is far away. One problem could come from rendering multiple episodes in parallel. Are you doing that? Gym's rendering wrapper doesn't cope with concurrent access well. To verify that you're dependencies work correctly, you could try:

import pybullet_envs
env = gym.make('AntBulletEnv-v0')
env = gym.wrappers.Monitor(env, '/path/to/outdir', lambda _: True)
done = False
env.reset()
while not done:
  _, _, done, _ = env.step(env.action_space.sample())
cwbeitel commented 6 years ago

Ah I see thank you! One of the issues was that in the visualize call the logdir was being set to a subdir of the logdir which didn't contain any checkpoints 😄

Now it is able to generate videos of 16+s, e.g. these, (more if increasing the max_length). What it looks like is happening is the agent is getting stuck and barely moves. But I'm guessing it just hasn't trained long enough.

Below are tboard scalar plots for training runs that ran for over 24h and both reached over 1.2M train phase steps. One was trained with 10 agents and the other with 30 with pink and green being the 30 agent case train and eval respectively

scalars1 scalars2

I need a minute to get up to speed to interpret the plots but on first take it looks problematic that the memory_size bottomed out in the 10x case, don't know why it's so variable in 30x case, and that the performance is increasing more stably in the 30 agent case. It would be cool to try with more than 30x on the rollouts (correct language?) and experiment with other network parameters as well as train with numerous independent replicas and pick the best. But it would be super cool to have a more intelligent way to optimize hyperparameters.

There have been some issues with runs getting evicted due to memory conflicts (including the 30x case above) and should be able to more reliably do long runs in the future by specifying resource requests and limits.

danijar commented 6 years ago

This doesn't look too bad. The mean rewards (normalize_reward/summary/cond/mean) is continually growing, which means that the agent trains and hasn't converged yet. For me, it took 2.5 M steps (which also varies a bit between runs), and the plots you show are at 1.6 M steps so far.

However, I am surprised that this progress took 24 hours to train. How many CPU cores does the server have? The most important parameter for maximum throughput is num_agents. Could you try out some other environments, for example pendulum and cheetah, to see if this is caused by the environment or the agent, please?

The memory size should incrementally grow from 0 to update_every, after which the agent trains on the data and clears the memory. The more num_agents, the faster the memory is filled it is quite possible that the logging frequency is too low to actually see the memory growing in the 30 agent case.

jlewi commented 6 years ago

@cwbeitel How big are the nodes in your cluster? It looks like you are using 8 CPU machines which seems pretty wimpy.

cwbeitel commented 6 years ago

Yeah I can change over to larger machines. I'll also switch on the autoscaler.

And yeah I can try some other environments as well, looks like those are included in pybullet.

Ah I see that makes sense about logging frequency vs. num_agents.

cwbeitel commented 6 years ago

Changed over to n1-standard-32 nodes, enabled autoscaler, started runs of HalfCheetahBulletEnv-v0, InvertedPendulumBulletEnv-v0, KukaBulletEnv-v0, and a continuation of AntBulletEnv-v0 in gs://kubeflow-rl-k8s/jobs/20180112T052347-pybullet_ant_ff, all with num_agents=30. The latter covered 93k train steps in 43min so I'm guessing the delay before was a combination of thrashing on 30x env.'s on 8 core machines and memory evictions from the tasks themselves as well as co-occupants.

danijar commented 6 years ago

Great to hear that you are using more cores now. 100k steps in 40 mins still sounds very slow to me though. On my local workstation with 6 CPU cores, HalfCheetah-v1 (the MuJoCo version which is faster, not the Bullet version) trains for 10M frames (20M for train and eval combined) in one hour, which is about 5500 steps per second.

cwbeitel commented 6 years ago

Oh sweet that's pretty fast! 🏎🏎🏎

About the difference in training speed

mean_score for the four envs over about 8h (click to enlarge)

mean_score_various_envs

Observations and thoughts

Renders are here:

cwbeitel commented 6 years ago

Hey so idk if people actually see comments in their notifications unless they're mentioned so @danijar @jlewi otherwise no rush in responding ✌️

cwbeitel commented 6 years ago

Simplification to the demo notebook with presumption for now that (1) users will build containers elsewhere and (2) rendering will happen as part of a training job (which it doesn't yet) by running in the train_and_render --mode.

danijar commented 6 years ago

Logging to a different server might well be the cause of the slow down. You could verify that by starting a run without specifying a logdir, so that not summaries and checkpoints are written.

cwbeitel commented 6 years ago

Sounds good I'll try that. One thing to note is that it looks like the count on global steps/second increases proportionally to the number of agents being used for rollouts. So if our num_agents params are different we'd see a different step rate.

cwbeitel commented 6 years ago

In which case we'd also end up running eval and writing logs overly frequently. That might be/also be the issue.

cwbeitel commented 6 years ago

This version of the demo notebook illustrates the doubling of step rate with doubling num_agents (as well as the not fetchable error).

danijar commented 6 years ago

One thing to have in mind when setting the number of agents is that the algorithm will not use more agents than episodes needed for an update. For most environments, the number of episodes per update is update_every = 30. Ideally, the number of agents should divide this number to not discard the remaining data, e.g. 10, 15, or 30. You can also increase update_every which will result in more stable but less data-efficient updates. This might allow scaling higher and using a larger learning rate (or the K-FAC optimizer!) to still benefit from the large batches.

Regarding the short video clips, did this problem occur for all environments? I just had the same problems when trying to render videos for MuJoCo's Humanoid-v1 and AntBulletEnv-v0. It might be something broken in the newest version of Gym?

cwbeitel commented 6 years ago

The problem resolved when specifying the correct log_dir (containing model checkpoints). I think the problem was coming from attempting to render without any model parameters so ffmpg would start writing the stream being produced by the environment before there was a crash due to there being no checkpoints in the provided logdir?

danijar commented 6 years ago

I'm getting the problem for valid log directories; created an https://github.com/openai/gym/issues/832.