NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

Pod CrashLoopBackOff when running sample #4

Closed Huifu1018 closed 4 years ago

Huifu1018 commented 4 years ago

I follow the instructions for running the KubeShare sample

When applying the sharepod1.yaml and sharepod2.yaml files , I see errors Init: CrashLoopBackOff, but no errors in logs.

kubectl create -f .

sharepod.kubeshare.nthu/sharepod1 created
sharepod.kubeshare.nthu/sharepod2 created

when I logs the pods, no errors in the following results.

kubectl logs sharepod1 GPU 0: Tesla T4 (UUID: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158)

kubectl logs sharepod2 GPU 0: Tesla T4 (UUID: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158)

Below are the discription of a pod : kubectl get po

Name:           sharepod1
Namespace:      default
Priority:       0
Node:           k8s-gpu/10.166.15.26
Start Time:     Mon, 13 Apr 2020 15:45:44 +0800
Labels:         <none>
Annotations:    cni.projectcalico.org/podIP: 192.168.134.208/32
                kubeshare/GPUID: abcde
                kubeshare/gpu_limit: 1.0
                kubeshare/gpu_mem: 1073741824
                kubeshare/gpu_request: 0.5
Status:         Running
IP:             192.168.134.208
Controlled By:  SharePod/sharepod1
Containers:
  cuda:
    Container ID:  docker://f96ace2735ad6e3b0adb87a207b540d8faf10de7a8f31b60ac62dea188f391f8
    Image:         nvidia/cuda:9.0-base
    Image ID:      docker-pullable://10.166.15.29:5000/nvidia/cuda@sha256:56bfa4e0b6d923bf47a71c91b4e00b62ea251a04425598d371a5807d6ac471cb
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
      -L
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 13 Apr 2020 15:48:31 +0800
      Finished:     Mon, 13 Apr 2020 15:48:31 +0800
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  500Mi
    Requests:
      cpu:     1
      memory:  500Mi
    Environment:
      NVIDIA_VISIBLE_DEVICES:      GPU-984b0041-8fa4-82e9-6111-5c8b7c351158
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
      LD_PRELOAD:                  /kubeshare/library/libgemhook.so.1
      POD_MANAGER_IP:              192.168.134.192
      POD_MANAGER_PORT:            50059
      POD_NAME:                    default/sharepod1
    Mounts:
      /kubeshare/library from kubeshare-lib (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l54xv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kubeshare-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /kubeshare/library
    HostPathType:
  default-token-l54xv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l54xv
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                    From              Message
  ----     ------   ----                   ----              -------
  Normal   Pulled   3m41s (x5 over 5m5s)   kubelet, k8s-gpu  Container image "nvidia/cuda:9.0-base" already present on machine
  Normal   Created  3m41s (x5 over 5m5s)   kubelet, k8s-gpu  Created container cuda
  Normal   Started  3m40s (x5 over 5m5s)   kubelet, k8s-gpu  Started container cuda
  Warning  BackOff  3m14s (x10 over 5m3s)  kubelet, k8s-gpu  Back-off restarting failed container

How can I solve this problem? @ncy9371 Thanks ~

ncy9371 commented 4 years ago

Hi

I think the containers were running well. A few clues:

Maybe your containers were already executing "nvidia-smi -L" for 5 times. Please try to add a restartPolicy "Never" to PodSpec in SharePod yaml then create again. That's why there is a "sleep infinity" at the end of the container command in the sample yaml file.

thx!

Huifu1018 commented 4 years ago

I have added the restartPolicy with "Never", the status shows the "Completed".

sharepod1                       0/1     Completed   0          8m34s
sharepod2                       0/1     Completed   0          8m33s

The description of sharepod1:

Name: sharepod1
Namespace: default
Priority: 0
Node: k8s-gpu/10.166.15.26
Start Time: Mon, 13 Apr 2020 19:09:01 +0800
Labels:
Annotations: cni.projectcalico.org/podIP: 192.168.134.217/32
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"kubeshare.nthu/v1","kind":"SharePod","metadata":{"annotations":{"kubeshare/GPUID":"abcde","kubeshare/gpu_limit":"1.0","kube...
kubeshare/GPUID: abcde
kubeshare/gpu_limit: 1.0
kubeshare/gpu_mem: 1073741824
kubeshare/gpu_request: 0.5
Status: Succeeded
IP: 192.168.134.217
Controlled By: SharePod/sharepod1
Containers:
cuda:
Container ID: docker://b8576cc60801fc54a10ac6a6c98c58f3e782dc3239c22c3adf13c63e19f0ffda
Image: nvidia/cuda:9.0-base
Image ID: docker-pullable://10.166.15.29:5000/nvidia/cuda@sha256:56bfa4e0b6d923bf47a71c91b4e00b62ea251a04425598d371a5807d6ac471cb
Port:
Host Port:
Command:
nvidia-smi
-L
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 13 Apr 2020 19:09:03 +0800
Finished: Mon, 13 Apr 2020 19:09:03 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 500Mi
Requests:
cpu: 1
memory: 500Mi
Environment:
NVIDIA_VISIBLE_DEVICES: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158
NVIDIA_DRIVER_CAPABILITIES: compute,utility
LD_PRELOAD: /kubeshare/library/libgemhook.so.1
POD_MANAGER_IP: 192.168.134.192
POD_MANAGER_PORT: 50065
POD_NAME: default/sharepod1
Mounts:
/kubeshare/library from kubeshare-lib (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-l54xv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubeshare-lib:
Type: HostPath (bare host directory volume)
Path: /kubeshare/library
HostPathType:
default-token-l54xv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-l54xv
Optional: false
QoS Class: Guaranteed
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Normal Pulled 3m44s kubelet, k8s-gpu Container image "nvidia/cuda:9.0-base" already present on machine
Normal Created 3m44s kubelet, k8s-gpu Created container cuda
Normal Started 3m43s kubelet, k8s-gpu Started container cuda

I don't know if the status is right,or “Running” is right. Thanks~ @ncy9371

Huifu1018 commented 4 years ago

I tried a few times those examples to create two mnist training jobs, but I can not find the sharepod.kubeshare.nthu/pod1 in the pods list everytime.

this seems like version of KubeShare issue?

ncy9371 commented 4 years ago

I have added the restartPolicy with "Never", the status shows the "Completed".

sharepod1                       0/1     Completed   0          8m34s
sharepod2                       0/1     Completed   0          8m33s

The description of sharepod1:

Name: sharepod1
Namespace: default
Priority: 0
Node: k8s-gpu/10.166.15.26
Start Time: Mon, 13 Apr 2020 19:09:01 +0800
Labels:
Annotations: cni.projectcalico.org/podIP: 192.168.134.217/32
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"kubeshare.nthu/v1","kind":"SharePod","metadata":{"annotations":{"kubeshare/GPUID":"abcde","kubeshare/gpu_limit":"1.0","kube...
kubeshare/GPUID: abcde
kubeshare/gpu_limit: 1.0
kubeshare/gpu_mem: 1073741824
kubeshare/gpu_request: 0.5
Status: Succeeded
IP: 192.168.134.217
Controlled By: SharePod/sharepod1
Containers:
cuda:
Container ID: docker://b8576cc60801fc54a10ac6a6c98c58f3e782dc3239c22c3adf13c63e19f0ffda
Image: nvidia/cuda:9.0-base
Image ID: docker-pullable://10.166.15.29:5000/nvidia/cuda@sha256:56bfa4e0b6d923bf47a71c91b4e00b62ea251a04425598d371a5807d6ac471cb
Port:
Host Port:
Command:
nvidia-smi
-L
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 13 Apr 2020 19:09:03 +0800
Finished: Mon, 13 Apr 2020 19:09:03 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 500Mi
Requests:
cpu: 1
memory: 500Mi
Environment:
NVIDIA_VISIBLE_DEVICES: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158
NVIDIA_DRIVER_CAPABILITIES: compute,utility
LD_PRELOAD: /kubeshare/library/libgemhook.so.1
POD_MANAGER_IP: 192.168.134.192
POD_MANAGER_PORT: 50065
POD_NAME: default/sharepod1
Mounts:
/kubeshare/library from kubeshare-lib (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-l54xv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubeshare-lib:
Type: HostPath (bare host directory volume)
Path: /kubeshare/library
HostPathType:
default-token-l54xv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-l54xv
Optional: false
QoS Class: Guaranteed
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Normal Pulled 3m44s kubelet, k8s-gpu Container image "nvidia/cuda:9.0-base" already present on machine
Normal Created 3m44s kubelet, k8s-gpu Created container cuda
Normal Started 3m43s kubelet, k8s-gpu Started container cuda

I don't know if the status is right,or “Running” is right. Thanks~ @ncy9371

The status is no difference from Pod. A long running service should be always "Running" until you terminate. A batch job such as model training, the status changed to "Completed" after the job finished and exit with no error (exitCode 0), otherwise "Failed".

ncy9371 commented 4 years ago

I tried a few times those examples to create two mnist training jobs, but I can not find the sharepod.kubeshare.nthu/pod1 in the pods list everytime.

this seems like version of KubeShare issue?

Can you provide more information about SharePod yaml, SharePod list, Pod list, and how you create your job?

Huifu1018 commented 4 years ago

I have added the restartPolicy with "Never", the status shows the "Completed".

sharepod1                       0/1     Completed   0          8m34s
sharepod2                       0/1     Completed   0          8m33s

The description of sharepod1:

Name: sharepod1
Namespace: default
Priority: 0
Node: k8s-gpu/10.166.15.26
Start Time: Mon, 13 Apr 2020 19:09:01 +0800
Labels:
Annotations: cni.projectcalico.org/podIP: 192.168.134.217/32
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"kubeshare.nthu/v1","kind":"SharePod","metadata":{"annotations":{"kubeshare/GPUID":"abcde","kubeshare/gpu_limit":"1.0","kube...
kubeshare/GPUID: abcde
kubeshare/gpu_limit: 1.0
kubeshare/gpu_mem: 1073741824
kubeshare/gpu_request: 0.5
Status: Succeeded
IP: 192.168.134.217
Controlled By: SharePod/sharepod1
Containers:
cuda:
Container ID: docker://b8576cc60801fc54a10ac6a6c98c58f3e782dc3239c22c3adf13c63e19f0ffda
Image: nvidia/cuda:9.0-base
Image ID: docker-pullable://10.166.15.29:5000/nvidia/cuda@sha256:56bfa4e0b6d923bf47a71c91b4e00b62ea251a04425598d371a5807d6ac471cb
Port:
Host Port:
Command:
nvidia-smi
-L
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 13 Apr 2020 19:09:03 +0800
Finished: Mon, 13 Apr 2020 19:09:03 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 500Mi
Requests:
cpu: 1
memory: 500Mi
Environment:
NVIDIA_VISIBLE_DEVICES: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158
NVIDIA_DRIVER_CAPABILITIES: compute,utility
LD_PRELOAD: /kubeshare/library/libgemhook.so.1
POD_MANAGER_IP: 192.168.134.192
POD_MANAGER_PORT: 50065
POD_NAME: default/sharepod1
Mounts:
/kubeshare/library from kubeshare-lib (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-l54xv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubeshare-lib:
Type: HostPath (bare host directory volume)
Path: /kubeshare/library
HostPathType:
default-token-l54xv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-l54xv
Optional: false
QoS Class: Guaranteed
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Normal Pulled 3m44s kubelet, k8s-gpu Container image "nvidia/cuda:9.0-base" already present on machine
Normal Created 3m44s kubelet, k8s-gpu Created container cuda
Normal Started 3m43s kubelet, k8s-gpu Started container cuda

I don't know if the status is right,or “Running” is right. Thanks~ @ncy9371

The status is no difference from Pod. A long running service should be always "Running" until you terminate. A batch job such as model training, the status changed to "Completed" after the job finished and exit with no error (exitCode 0), otherwise "Failed".

Thanks a lot!~ I will try to run continuous services to test the sample.

Huifu1018 commented 4 years ago

I tried a few times those examples to create two mnist training jobs, but I can not find the sharepod.kubeshare.nthu/pod1 in the pods list everytime. this seems like version of KubeShare issue?

Can you provide more information about SharePod yaml, SharePod list, Pod list, and how you create your job?

Below are the yaml of a pod1, and sharepod list showed with kubectl get sharepod :

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod1
  annotations:
    "kubeshare/gpu_request": "0.4"
    "kubeshare/gpu_limit": "1.0"
    "kubeshare/gpu_mem": "3145728000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: 10.166.15.29:5000/tensorflow/tensorflow:1.15.2-gpu-py3
    command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  nodeName: k8s-gpu
  restartPolicy: OnFailure
NAME   AGE
pod1   13h
pod2   14h

after I created the pod1 and pod2, I only can find the information in sharepod list, no any pod with kubectl get pods --all-namespaces.

if I do something wrong when running the KubeShare sample ? Thanks ~ @ncy9371

ncy9371 commented 4 years ago

I tried a few times those examples to create two mnist training jobs, but I can not find the sharepod.kubeshare.nthu/pod1 in the pods list everytime. this seems like version of KubeShare issue?

Can you provide more information about SharePod yaml, SharePod list, Pod list, and how you create your job?

Below are the yaml of a pod1, and sharepod list showed with kubectl get sharepod :

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod1
  annotations:
    "kubeshare/gpu_request": "0.4"
    "kubeshare/gpu_limit": "1.0"
    "kubeshare/gpu_mem": "3145728000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: 10.166.15.29:5000/tensorflow/tensorflow:1.15.2-gpu-py3
    command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  nodeName: k8s-gpu
  restartPolicy: OnFailure
NAME   AGE
pod1   13h
pod2   14h

after I created the pod1 and pod2, I only can find the information in sharepod list, no any pod with kubectl get pods --all-namespaces.

if I do something wrong when running the KubeShare sample ? Thanks ~ @ncy9371

The problem is that you had specified the nodeName in PodSpec. If you want to require potion GPUs by SharePod, you can either leave both ".metadata.annotations["kubeshare/GPUID"]" and ".spec.nodeName" empty, or assign both these two values by yourself (according to SharePod with NodeName and GPUID (advanced)). The former means the SharePod will be scheduled by kubeshare-scheduler, and the latter means you schedule the SharePod by yourself.

In your case, a quick solution is removing the ".spec.nodeName" from your yaml file.

thx!

Huifu1018 commented 4 years ago

I tried a few times those examples to create two mnist training jobs, but I can not find the sharepod.kubeshare.nthu/pod1 in the pods list everytime. this seems like version of KubeShare issue?

Can you provide more information about SharePod yaml, SharePod list, Pod list, and how you create your job?

Below are the yaml of a pod1, and sharepod list showed with kubectl get sharepod :

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod1
  annotations:
    "kubeshare/gpu_request": "0.4"
    "kubeshare/gpu_limit": "1.0"
    "kubeshare/gpu_mem": "3145728000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: 10.166.15.29:5000/tensorflow/tensorflow:1.15.2-gpu-py3
    command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  nodeName: k8s-gpu
  restartPolicy: OnFailure
NAME   AGE
pod1   13h
pod2   14h

after I created the pod1 and pod2, I only can find the information in sharepod list, no any pod with kubectl get pods --all-namespaces. if I do something wrong when running the KubeShare sample ? Thanks ~ @ncy9371

The problem is that you had specified the nodeName in PodSpec. If you want to require potion GPUs by SharePod, you can either leave both ".metadata.annotations["kubeshare/GPUID"]" and ".spec.nodeName" empty, or assign both these two values by yourself (according to SharePod with NodeName and GPUID (advanced)). The former means the SharePod will be scheduled by kubeshare-scheduler, and the latter means you schedule the SharePod by yourself.

In your case, a quick solution is removing the ".spec.nodeName" from your yaml file.

thx!

This worked for me. Thanks for looking into this @ncy9371