NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
238 stars 43 forks source link

Questions about CDR claim parameters #127

Closed JasonHe-WQ closed 1 month ago

JasonHe-WQ commented 4 months ago

I successfully ran the quick start demo gputest1,2,3 these days on Ubuntu 22.04, all the behaviors are performed as expected. With great appreciation to the maintainers and commiters, I still do have some questions on resource ResouceClaim and ResourceClaimTemplate.

OS: Ubuntu22.04LTS
kernel:6.5.0-35-generic
GPU: RTX4080 Laptop
cuda:12.3
driver:545
cluster: 1.29 by kind in the demo, 2 nodes, one master, one worker

Question1:

What is the different of feild ResourceClaimTemplate between gputest1 and gputest2? Is it true that one ResourceClaimTemplate claimed in two Pods rather than two containers leads to # Each container asking for 1 distinct GPU?

Question2:

Is there a way for 2 containers of one Pods to claim to distinct single GPU each?

Question3:

Comparing gputest1 and gputest3, the main difference are below. A. using ResourceClaim instead of ResourceClaimTemplate B .name of ResourceClaim or ResourceClaimTemplate C. another spec in ResourceClaimTemplate. Cloud anyone tells me how are these differents result in the SharingGPU or DistinctGPU?

Thanks again for the devotion of you, and hope you could answer me sooooooon.

JasonHe-WQ commented 4 months ago

Question5:

Does ResourceClaim mean binding to exactly one single GPU, so that all the containers using this ResourceClaim can share one GPU? And does ResourceClaimTemplate mean binding to any single GPU?

klueska commented 1 month ago

What is the different of feild ResourceClaimTemplate between gputest1 and gputest2? Is it true that one ResourceClaimTemplate claimed in two Pods rather than two containers leads to # Each container asking for 1 distinct GPU?

A ResourceClaim has resources directly bound to it, and any pod that references it will have shared access to those resources.

A ResourceClaimTemplate provides a template for a ResourceClaim that will be generated on the fly for each pod that references it. In this way each claim will have their own unique ResourceClaim with their own unique resources bound to it.

Is there a way for 2 containers of one Pod to claim to distinct single GPU each?

Note: Below is the API that was valid from Kubernetes 1.26-1.30. It has changed slightly for 1.31 but the mechanism is similar.

---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test
  name: unique-gpu
spec:
  spec:
    resourceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test
  name: pod
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    resources:
      claims:
      - name: gpu0
  - name: ctr1
    resources:
      claims:
      - name: gpu1
  resourceClaims:
  - name: gpu0
    source:
      resourceClaimTemplateName: unique-gpu
  - name: gpu1
    source:
      resourceClaimTemplateName: unique-gpu

Cloud anyone tells me how are these differents result in the SharingGPU or DistinctGPU?

My answer to question (1) hopefully clarifies this already. Each reference to a ResourceClaimTemplate triggers the creation of a unique ResourceClaim (to which unique resources will eventually be bound). Each reference to a ResourceClaim gives shared access to the resources bound to it.