NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.71k stars 614 forks source link

Questions about GPU time-sharing on Kubernetes #404

Open jxl4650152 opened 1 year ago

jxl4650152 commented 1 year ago

1. Issue or feature description

Is it possible to enable time-sharing on select GPUs within a single node, rather than all of them? If not all GPUs on a single node support time-slicing, like Kepler K80 GPU, what behavior can be expected from the plugin?

elezar commented 1 year ago

The device plugin does not currently have a mechanism to expose different devices as different resource types. It is also not possible to apply sharing settings on a per-GPU basis.

Note that the plugin will still allow a sharing setting to be applied to GPUs that may not support this feature and effectively reports the same device multiple times to the Kubelet. The behaviour of containers that are both started on a device where time-slicing is not supported will depend on the application and the device, but should mirror what happens when two applications which access the same device are started on the host.

jxl4650152 commented 1 year ago

Thank you for your response. We now have a clear understanding of the limitations of this plugin. We are using this repo for managing GPUs on Kubernetes, and we have also noticed another repo related to DRA(Dynamic Resource Allocation), which offers greater flexibility and richer functionality. Regarding this, we have another question: Will there be strong support for using GPUs on Kubernetes through the DRA approach in the future? I'm not sure if it's appropriate to ask this question here. If not, could you please let me know where I should ask it?

klueska commented 1 year ago

Yes DRA will be well supported. We see it as the future of GPU support in Kubernetes.

Please see: https://m.youtube.com/watch?v=_fi9asserLE&ab_channel=CNCF%5BCloudNativeComputingFoundation%5D

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

frittentheke commented 2 months ago

As @elezar said this is not (yet possible), see https://github.com/NVIDIA/k8s-device-plugin/blob/35ad18080eded1889dc1eaee1132debddfd6757c/api/config/v1/replicas.go#L61

I myself would also like to only enable time-slicing on a subset of GPUs. Is there any chance we could make this issue here a feature request @elezar? Or is this capability never coming to the device plugin in favor of DRA? That seems quite a while out though - https://github.com/NVIDIA/k8s-dra-driver/issues/131