Open cwbeitel opened 6 years ago
It seems like some ops that are defined inside the tf.cond are being fetched for synchronizing between workers. I'm surprised that tf.train.SyncReplicasOptimizer actually fetches anything though, as the distributed session should span all machines. I could imagine that it will be difficult to restructure the code so that all synchronization points are outside of any tf.cond statements. @mrry Do you have an idea how to solve this?
I guess one approach would be to just use MPI as is used in OAI Baselines MpiAdam.
Since the error message is:
ValueError: Operation u'end_episode/cond/cond/training/scan_1/while/Assign' has been marked as not fetchable.
...can you show the code that creates that op? This doesn't look like something the SyncReplicasOptimizer
would do, since it doesn't generally create or consume the result of Assign
ops, so for now I suspect the problem is in the user code.
@mrry Would you mind taking a look at the code linked above, please?
Shoot sorry I the links in the first comment above should have been tied to a specific commit, i.e. the one that had run-remote which could be used to produce the error.
Anyways the most recent commit also produces the error with sync_replicas=True, see notebook for params and logs as well as ksonnet params for the fetchable error-producing job.
Also this is running with tf v1.4.1
INFO:tensorflow:Tensorflow version: 1.4.1
INFO:tensorflow:Tensorflow git version: v1.4.0-19-ga52c8d9
See https://github.com/tensorflow/k8s/pull/159 and error.
Error can be reproduced with --sync_replicas set to True (via task.py or passing param via job YAML) using run-remote on a cluster deployed using deploy-gke.
@jlewi @danijar