I'm walking through the labs and I got an error on my first job using the wbuchwalter/tf-mnist:gpu image. My yaml is described below (copied from the labs). I created an AKS cluster with Standard_NC6 VM size and it looks like the GPU is in place.
When I create the job, the pod shows the below error:
2018-06-18 09:05:14.835740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
My yaml for the job:
apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
name: 2-mnist-training # Name of our job
spec:
template: # Template of the Pod that is going to be run by the Job
metadata:
name: 2-mnist-training # Name of the pod
spec:
containers: # List of containers that should run inside the pod, in our case there is only one.
- name: tensorflow
image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
volumes:
- name: nvidia
hostPath:
path: /usr/local/nvidia
restartPolicy: OnFailure # restart the pod if it fails
I'm walking through the labs and I got an error on my first job using the
wbuchwalter/tf-mnist:gpu
image. My yaml is described below (copied from the labs). I created an AKS cluster with Standard_NC6 VM size and it looks like the GPU is in place.When I create the job, the pod shows the below error:
My yaml for the job: