Azure / kubeflow-labs

šŸ‘©ā€šŸ”¬ Train and Serve TensorFlow Models at Scale with Kubernetes and Kubeflow on Azure
Creative Commons Attribution 4.0 International
290 stars 99 forks source link

2-kubernetes: Error with initial job on AKS GPU #44

Closed chzbrgr71 closed 6 years ago

chzbrgr71 commented 6 years ago

I'm walking through the labs and I got an error on my first job using the wbuchwalter/tf-mnist:gpu image. My yaml is described below (copied from the labs). I created an AKS cluster with Standard_NC6 VM size and it looks like the GPU is in place.

When I create the job, the pod shows the below error:

2018-06-18 09:05:14.835740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

My yaml for the job:

apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
  name: 2-mnist-training # Name of our job
spec:
  template: # Template of the Pod that is going to be run by the Job
    metadata:
      name: 2-mnist-training # Name of the pod
    spec:
      containers: # List of containers that should run inside the pod, in our case there is only one.
      - name: tensorflow
        image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
        resources:
          limits:
            alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
        volumeMounts:
        - name: nvidia
          mountPath: /usr/local/nvidia
      volumes:
      - name: nvidia
        hostPath:
          path: /usr/local/nvidia
      restartPolicy: OnFailure # restart the pod if it fails
chzbrgr71 commented 6 years ago

this was just a failed gpu cluster