NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
195 stars 36 forks source link

Add a simple TimesSlicing sharing example to quickstart (WIP) #113

Closed yuanchen8911 closed 1 month ago

yuanchen8911 commented 2 months ago

This PR adds an example for TimeSlicing sharing and updates the README description.

yuanchen8911 commented 2 months ago

/cc @klueska

yuanchen8911 commented 2 months ago

close it and may add it to a separate folder.

yuanchen8911 commented 2 months ago

Since there's a MPS example in the quickstart folder, an additional example showing a different sharing strategy TimeSlicing strategy would be helpful. WDYT, @klueska ? We can put them in a separate folder if it works better.

klueska commented 2 months ago

Yes, now that I realize this is not the top-level README, I agree this makes sense. WIll review in more detail tomorrow.

klueska commented 2 months ago

Would this now be superceded by https://github.com/NVIDIA/k8s-dra-driver/pull/118 if we moved those to this level?

yuanchen8911 commented 2 months ago

Would this now be superceded by #118 if we moved those to this level?

Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?

$ k get pods -n mpsc-timeslicing-gpu-test
NAME        READY   STATUS    RESTARTS   AGE
gpu-pod-1   1/1     Running   0          5s
gpu-pod-2   0/1     Pending   0          5s

k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          92s
gpu-pod-2   1/1     Running     0          92s

$ k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          7m14s
gpu-pod-2   0/1     Completed   0          7m14s
yuanchen8911 commented 2 months ago

Would this now be superceded by #118 if we moved those to this level?

Yes, we won't need this if that PR is merged. That folder contains two examples for SimeSlicing.

yuanchen8911 commented 2 months ago

Would this now be superceded by #118 if we moved those to this level?

Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?

$ k get pods -n mpsc-timeslicing-gpu-test
NAME        READY   STATUS    RESTARTS   AGE
gpu-pod-1   1/1     Running   0          5s
gpu-pod-2   0/1     Pending   0          5s

k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          92s
gpu-pod-2   1/1     Running     0          92s

$ k get pods -n timeslicing-gpu-test
NAME        READY   STATUS      RESTARTS   AGE
gpu-pod-1   0/1     Completed   0          7m14s
gpu-pod-2   0/1     Completed   0          7m14s

As @klueska suggested, we should use ResourceClaim (not ResoureceClaimTemplate). That resolved the problem.