NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
226 stars 41 forks source link

container restarts when short task completes over a DRA slice #34

Closed asm582 closed 9 months ago

asm582 commented 9 months ago

Hello,

I am trying to run below yaml over a GPU slice and we see that it restarts after sleep 2 command completes:

gpu-test1            pod1                                                           0/1     CrashLoopBackOff   5 (81s ago)   4m34s
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m51s                  default-scheduler  0/2 nodes are available: waiting for dynamic resource controller to create the resourceclaim "pod1-mig1g". no new claims to deallocate, preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
  Warning  FailedScheduling  3m50s                  default-scheduler  running Reserve plugin "DynamicResources": waiting for resource driver to allocate resource
  Normal   Scheduled         3m47s                  default-scheduler  Successfully assigned gpu-test1/pod1 to k8s-dra-driver-cluster-worker
  Normal   Pulled            2m10s (x5 over 3m46s)  kubelet            Container image "ubuntu:22.04" already present on machine
  Normal   Created           2m10s (x5 over 3m46s)  kubelet            Created container ctr
  Normal   Started           2m10s (x5 over 3m46s)  kubelet            Started container ctr
  Warning  BackOff           100s (x10 over 3m41s)  kubelet            Back-off restarting failed container ctr in pod pod1_gpu-test1(e595b536-224d-4a71-85a2-80ac40183386)

Below is the yaml:


---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test1

---

apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: MigDeviceClaimParameters
metadata:
  namespace: gpu-test1
  name: mig-1g.10gb
spec:
  profile: "1g.10gb"
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test1
  name: mig-1g.10gb
spec:
  spec:
    resourceClassName: gpu.nvidia.com
    parametersRef:
      apiGroup: gpu.resource.nvidia.com
      kind: MigDeviceClaimParameters
      name: mig-1g.10gb

---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: pod1
  labels:
    app: pod
spec:
  resourceClaims:
  - name: mig1g
    source:
      resourceClaimTemplateName:  mig-1g.10gb
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; sleep 2"]
    resources:
      claims:
      - name: mig1g

Can you please suggest how can we run short tasks over slice, thanks

asm582 commented 9 months ago

ok, I changed the above example to batch/v1 job and it remains completed. Maybe if the pod completes then could be the case that the DRA controller auto-restarts it?

klueska commented 9 months ago

ok, I changed the above example to batch/v1 job and it remains completed. Maybe if the pod completes then could be the case that the DRA controller auto-restarts it?

That has nothing to do with DRA. That is how standalone pods work in K8s -- when they complete they restart by default.