Closed yuanchen8911 closed 1 month ago
/cc @klueska
close it and may add it to a separate folder.
Since there's a MPS example in the quickstart folder, an additional example showing a different sharing strategy TimeSlicing
strategy would be helpful. WDYT, @klueska ? We can put them in a separate folder if it works better.
Yes, now that I realize this is not the top-level README, I agree this makes sense. WIll review in more detail tomorrow.
Would this now be superceded by https://github.com/NVIDIA/k8s-dra-driver/pull/118 if we moved those to this level?
Would this now be superceded by #118 if we moved those to this level?
Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?
$ k get pods -n mpsc-timeslicing-gpu-test
NAME READY STATUS RESTARTS AGE
gpu-pod-1 1/1 Running 0 5s
gpu-pod-2 0/1 Pending 0 5s
k get pods -n timeslicing-gpu-test
NAME READY STATUS RESTARTS AGE
gpu-pod-1 0/1 Completed 0 92s
gpu-pod-2 1/1 Running 0 92s
$ k get pods -n timeslicing-gpu-test
NAME READY STATUS RESTARTS AGE
gpu-pod-1 0/1 Completed 0 7m14s
gpu-pod-2 0/1 Completed 0 7m14s
Would this now be superceded by #118 if we moved those to this level?
Yes, we won't need this if that PR is merged. That folder contains two examples for SimeSlicing.
Would this now be superceded by #118 if we moved those to this level?
Let's hold off on the TimeSlicing example. It's not working as expected on my Linux workstation. The test deployed two pods configured to share a GPU via TimeSlicing. However, they ran sequentially rather than in parallel. The pending pod was unschedulable due to insufficient resources and didn't start until the first one completed. Did I misconfigure something, or does GeForce not support TimeSlicing?
$ k get pods -n mpsc-timeslicing-gpu-test NAME READY STATUS RESTARTS AGE gpu-pod-1 1/1 Running 0 5s gpu-pod-2 0/1 Pending 0 5s k get pods -n timeslicing-gpu-test NAME READY STATUS RESTARTS AGE gpu-pod-1 0/1 Completed 0 92s gpu-pod-2 1/1 Running 0 92s $ k get pods -n timeslicing-gpu-test NAME READY STATUS RESTARTS AGE gpu-pod-1 0/1 Completed 0 7m14s gpu-pod-2 0/1 Completed 0 7m14s
As @klueska suggested, we should use ResourceClaim
(not ResoureceClaimTemplate
). That resolved the problem.
This PR adds an example for
TimeSlicing
sharing and updates the README description.