cwbeitel / kubeflow-rl

Demonstrations of RL on kubeflow
Apache License 2.0
9 stars 0 forks source link

Op not fetchable when wrapping optimizer with SyncReplicasOptimizer #1

Open cwbeitel opened 6 years ago

cwbeitel commented 6 years ago

See https://github.com/tensorflow/k8s/pull/159 and error.

Error can be reproduced with --sync_replicas set to True (via task.py or passing param via job YAML) using run-remote on a cluster deployed using deploy-gke.

@jlewi @danijar

danijar commented 6 years ago

It seems like some ops that are defined inside the tf.cond are being fetched for synchronizing between workers. I'm surprised that tf.train.SyncReplicasOptimizer actually fetches anything though, as the distributed session should span all machines. I could imagine that it will be difficult to restructure the code so that all synchronization points are outside of any tf.cond statements. @mrry Do you have an idea how to solve this?

cwbeitel commented 6 years ago

I guess one approach would be to just use MPI as is used in OAI Baselines MpiAdam.

mrry commented 6 years ago

Since the error message is:

ValueError: Operation u'end_episode/cond/cond/training/scan_1/while/Assign' has been marked as not fetchable.

...can you show the code that creates that op? This doesn't look like something the SyncReplicasOptimizer would do, since it doesn't generally create or consume the result of Assign ops, so for now I suspect the problem is in the user code.

cwbeitel commented 6 years ago

Sure the code is here https://github.com/cwbeitel/agents/blob/master/agents/ppo/algorithm.py, see also https://github.com/tensorflow/k8s/pull/159#issuecomment-352057950

danijar commented 6 years ago

@mrry Would you mind taking a look at the code linked above, please?

cwbeitel commented 6 years ago

Shoot sorry I the links in the first comment above should have been tied to a specific commit, i.e. the one that had run-remote which could be used to produce the error.

Anyways the most recent commit also produces the error with sync_replicas=True, see notebook for params and logs as well as ksonnet params for the fetchable error-producing job.

Also this is running with tf v1.4.1

INFO:tensorflow:Tensorflow version: 1.4.1
INFO:tensorflow:Tensorflow git version: v1.4.0-19-ga52c8d9