cwbeitel / kubeflow-rl

Demonstrations of RL on kubeflow
Apache License 2.0
8 stars 0 forks source link

Asynchronous rendering #3

Open jlewi opened 6 years ago

jlewi commented 6 years ago

Do GPUs help with movie rendering?

Is a rendering a movie something we could do as a separate K8s job?

cwbeitel commented 6 years ago

Yes it could be done as a separate job.

I don't know whether GPU's will help with rendering. Rendering involves evaluating the model on the training environment and passing the resulting frames to ffmpg to render. So the question is whether the bottleneck is in model eval or these other two steps, if the former then yes.

cwbeitel commented 6 years ago

It's better to have separate train and render jobs because the resources needed for the two may not be the same. Also it may be interesting to repeat renders with different parameters e.g. resolutions.

jlewi commented 6 years ago

Does the render job run at the end of the training job? Is there anyway to run it in parallel so we don't need to wait for the job to finish?

cwbeitel commented 6 years ago

Ah interesting that's a great idea. All the render job needs is checkpoints. So renders could be performed as often as each time a checkpoint is written (and as you're suggesting in parallel).

One way to do this is to have a single long-running render job that stores the global_step of the last checkpoint it rendered and polls the checkpoint dir for more recent checkpoints, potentially occupying the amount of resources needed to render while polling (provided there is not already a new checkpoint when the render for the last checkpoint is complete which would probably mean you're writing checkpoints too often).

Another would be to have training jobs enqueue render jobs either in a message queue or by directly creating render jobs (the latter seems not to be separating concerns appropriately).

And another would be to have the writing of checkpoints trigger renders via a third-party storage object change notifier and file name matcher. For now it looks like minio does not support the GCS notifications API.

Triggers aside renders could be run using lightweight fission functions for one episode at a time (allowing many in parallel). One issue with this would be for models that require a lot of memory - loading the model many times in parallel or doing so with a serverless framework that doesn't support mem requests. But looks like this is in progress for Fission https://github.com/fission/fission/issues/193.

Introducing an event-/message-based serverless framework to kubeflow just for this purpose might be overboard but less so if you're already using these in your production system. As a device streams events how are those communicated to models deployed with serving? If we can assume that production systems will be expected to have serverless and messaging already deployed then adding these is in a sense free and having training jobs trigger renders by emitting a "checkpoint written" message might be a pretty efficient, easy, decoupled way to go initially.

But if that's not a fair assumption or out of scope then it would be most expedient to start with a render job that checks for new checkpoints as a cron job. Or perhaps produces renders on that schedule and doesn't necessarily check that they are based on new checkpoints.

Thoughts or simpler approaches?

cwbeitel commented 6 years ago

Of course there's also manually submitting render jobs (which you can do with a logdir for a training job that is ongoing) as I have in the notebook but which isn't quite working yet

ks param set agents-ppo logdir [log dir for running job]
ks param set agents-ppo num_cpu 1
ks param set agents-ppo run_mode render
ks apply gke -c agents-ppo
cwbeitel commented 6 years ago

See https://github.com/cwbeitel/kubeflow-rl/commit/6cdfcac621f066cf88748d07b30079054fcbbcf1

cwbeitel commented 6 years ago

So it's working to trigger renders over HTTP (render jobs run, renders generated, uploaded to gcs), see notebook.

The cleanest way to trigger renders seems to be to use a hook into a MonitoredTrainingSession.

Also each render job should only produce one render instead of 7 that are being produced now (despite num_agents=1 and num_episodes=1)

jlewi commented 6 years ago

This is really cool. The idea of using fission to hook into monitored training session is pretty neat.

This might be useful for TB as a service because we could trigger a job for each event dir to load the data into a DB like MySQL.

cwbeitel commented 6 years ago

Yeah thanks man. On the positive side it seems like a way to provide flexible subscription to training events by the broader infrastructure. I was thinking messages over topics would allow flexibility in what other services consumed particular events. E.g. email/slack notify when performance exceeds X, notify when a job goes down, update hparams and re-start job if performance is below nth %ile of job pool, stream performance metrics to non-tensorboard dashboards, etc.

TB as a service sounds cool. Assuming you're referring to the sync of training event and checkpoint data. Looks like tboard is moving towards SQL https://github.com/tensorflow/tensorboard/issues/92 which would be valuable for many reasons. It would speed things up to log locally or at least to a local cache and sync the result of that as appropriate. It would probably be a better design to have training jobs emit event streams that are cached and consumed by a tensorboard service than logging to local disk and triggering separate filesystem syncs of that log directory.

Generally related it would be nice to be able to see renders within tensorboard (and perhaps to trigger them there). (And perhaps trigger and monitor jobs from tboard as well).