Azure / kubeflow-labs

👩‍🔬 Train and Serve TensorFlow Models at Scale with Kubernetes and Kubeflow on Azure
Creative Commons Attribution 4.0 International
289 stars 99 forks source link

Why PS is taken as the master in distributed training? #43

Closed cheyang closed 6 years ago

cheyang commented 6 years ago

I'm trying to follow https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow to test the distributed training. But I have the result is that the PS is completed, not the master.

# RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath='{.spec.RuntimeId}')
# kubectl get po -lruntime_id=$RUNTIMEID -a
NAME                                        READY     STATUS      RESTARTS   AGE
mnist-simple-gpu-dist-master-0rzp-0-v0kk6   1/1       Running     0          2h
mnist-simple-gpu-dist-ps-0rzp-0-dtuin       0/1       Completed   0          2h
mnist-simple-gpu-dist-worker-0rzp-0-cz3f5   1/1       Running     0          2h 

And the PS logs are:

kubectl logs mnist-simple-gpu-dist-ps-0rzp-0-dtuin
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-06-09 14:19:12.971461: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-06-09 14:19:12.972787: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> mnist-simple-gpu-dist-master-0rzp-0:2222}
2018-06-09 14:19:12.972811: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-06-09 14:19:12.972818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-0rzp-0:2222}
2018-06-09 14:19:12.974524: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
WARNING:tensorflow:From /app/main.py:151: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From /app/main.py:188: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-09 14:19:33.448292: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 8cede9eb21bff1b6 with config:
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1088
Accuracy at step 10: 0.7341
Accuracy at step 20: 0.8266
Accuracy at step 30: 0.8784
Accuracy at step 40: 0.8966
Accuracy at step 50: 0.9095
Accuracy at step 60: 0.9149
Accuracy at step 70: 0.9176
Accuracy at step 80: 0.92
Accuracy at step 90: 0.9217
Adding run metadata for 99
Accuracy at step 100: 0.9283
Accuracy at step 110: 0.9244
Accuracy at step 120: 0.9369
Accuracy at step 130: 0.9415
Accuracy at step 140: 0.9421
Accuracy at step 150: 0.945
Accuracy at step 160: 0.9484
Accuracy at step 170: 0.9511

And here is tfjob definition:

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-06-09T14:19:10Z
  generation: 0
  name: mnist-simple-gpu-dist
  namespace: default
  resourceVersion: "5259591"
  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/tfjobs/mnist-simple-gpu-dist
  uid: 14169ad0-6bf0-11e8-9b09-00163e085552
spec:
  RuntimeId: 0rzp
  replicaSpecs:
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          env:
          - name: TEST_TMPDIR
            value: /training
          image: ritazh/tf-mnist:distributedgpu 
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /training
            name: kubeflow-dist-nas-mnist
        restartPolicy: OnFailure
        volumes:
        - name: kubeflow-dist-nas-mnist
          persistentVolumeClaim:
            claimName: kubeflow-dist-nas-mnist
    tfPort: 2222
    tfReplicaType: MASTER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributedgpu 
          imagePullPolicy: Always
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: WORKER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributed
          imagePullPolicy: Always
          name: tensorflow
          resources: {}
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: PS
  terminationPolicy:
    chief:
      replicaIndex: 0
      replicaName: MASTER
  tfImage: tensorflow/tensorflow:1.3.0
status:
  phase: Running
  reason: ""
  replicaStatuses:
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: MASTER
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: WORKER
  - ReplicasStates:
      Succeeded: 1
    state: Succeeded
    tf_replica_type: PS
  state: Running