NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
212 stars 38 forks source link

support skip old architecture version GPU settings time slice #58

Open wawa0210 opened 7 months ago

wawa0210 commented 7 months ago

/kind feature

fix https://github.com/NVIDIA/k8s-dra-driver/issues/41

Flexible judgment based on the GPU architecture series. If the model does not support time slices, ignore the settings to ensure normal operation of the function.

Test Results

Tesla P4 can run demo1 normally https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/specs/quickstart/gpu-test1.yaml

➜  k8s-dra-driver git:(main) ✗ kubectl get all -n kubectl -n gpu-test1
NAME       READY   STATUS    RESTARTS   AGE
pod/pod2   1/1     Running   0          14m
➜  k8s-dra-driver git:(main) ✗ kubectl -n gpu-test1 exec -it pod2 nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Wed Jan 24 15:58:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:13:00.0 Off |                    0 |
| N/A   27C    P8               7W /  75W |      0MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

➜  k8s-dra-driver git:(main) ✗ kubectl get ResourceClaim -n gpu-test1 pod2-gpu -o yaml
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  creationTimestamp: "2024-01-24T15:43:31Z"
  finalizers:
  - gpu.resource.nvidia.com/deletion-protection
  name: pod2-gpu
  namespace: gpu-test1
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: pod2
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff
  resourceVersion: "53101"
  uid: 0e2c7b7c-7357-4742-8eb1-101b0c673fd1
spec:
  allocationMode: WaitForFirstConsumer
  resourceClassName: gpu.nvidia.com
status:
  allocation:
    availableOnNodes:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - 172-30-40-100
    shareable: true
  driverName: gpu.resource.nvidia.com
  reservedFor:
  - name: pod2
    resource: pods
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff
wawa0210 commented 7 months ago

@klueska @elezar

friendly ping Looking forward to review, especially detactSupportTimeSliceByArch needs to add more architecture and needs a complete support matrix

Can you help me update this form?

arch support timeslice
Fermi unknow
Kepler unknow
Maxwell unknow
Pascal false
Volta unknow
Turing true
Ampere true
Hopper true
Ada true
klueska commented 6 months ago

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

wawa0210 commented 6 months ago

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

Do you mean that a gpu card that does not support settings time slice will not be able to use this function and will error out ?

wawa0210 commented 5 months ago

@elezar @klueska friendly ping

elezar commented 5 months ago

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

wawa0210 commented 5 months ago

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

Agree with you, updated,maybe we should discuss this scenario in a separate issue --> https://github.com/NVIDIA/k8s-dra-driver/issues/81

wawa0210 commented 4 months ago

@elezar PTAL