# RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath='{.spec.RuntimeId}')
# kubectl get po -lruntime_id=$RUNTIMEID -a
NAME READY STATUS RESTARTS AGE
mnist-simple-gpu-dist-master-0rzp-0-v0kk6 1/1 Running 0 2h
mnist-simple-gpu-dist-ps-0rzp-0-dtuin 0/1 Completed 0 2h
mnist-simple-gpu-dist-worker-0rzp-0-cz3f5 1/1 Running 0 2h
And the PS logs are:
kubectl logs mnist-simple-gpu-dist-ps-0rzp-0-dtuin
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
2018-06-09 14:19:12.971461: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-06-09 14:19:12.972787: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> mnist-simple-gpu-dist-master-0rzp-0:2222}
2018-06-09 14:19:12.972811: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-06-09 14:19:12.972818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-0rzp-0:2222}
2018-06-09 14:19:12.974524: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
WARNING:tensorflow:From /app/main.py:151: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See tf.nn.softmax_cross_entropy_with_logits_v2.
WARNING:tensorflow:From /app/main.py:188: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-09 14:19:33.448292: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 8cede9eb21bff1b6 with config:
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1088
Accuracy at step 10: 0.7341
Accuracy at step 20: 0.8266
Accuracy at step 30: 0.8784
Accuracy at step 40: 0.8966
Accuracy at step 50: 0.9095
Accuracy at step 60: 0.9149
Accuracy at step 70: 0.9176
Accuracy at step 80: 0.92
Accuracy at step 90: 0.9217
Adding run metadata for 99
Accuracy at step 100: 0.9283
Accuracy at step 110: 0.9244
Accuracy at step 120: 0.9369
Accuracy at step 130: 0.9415
Accuracy at step 140: 0.9421
Accuracy at step 150: 0.945
Accuracy at step 160: 0.9484
Accuracy at step 170: 0.9511
I'm trying to follow https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow to test the distributed training. But I have the result is that the
PS
is completed, not themaster
.And the PS logs are:
And here is tfjob definition: