Open Suckzoo opened 1 year ago
We are aware of a similar, yet slightly different issue with GFD support on DGX-Station machines. Our plan for the next release is to completely filter out all DISPLAY devices, and only support COMPUTE devices in our enumeration of GPUs for both the device plugin and GFD. In the future, we may decide to support DISPLAY devices, but at that point they would show up as a different type of allocatable device (e.g. nvidia.com/display
instead of nvidia.com/gpu
), and the labels applied by GFD would reflect this similarly (i.e. nvidia.com/display.product
and nvidia.com/display.replicas
, etc.).
@klueska Thanks for your quick response. One quick question: considering a node consists of 2 RTX 2080 and 2 RTX 3090 (or whatever model, anyway a computer equipped two different model of GPU; I don't know it's a usual setup or not), how would the GFD work in such situation?
It only reports one of them at present. Whichever ones happens to show up as index 0 when calling into NVIDIAs NVML library.
I meant, GFD in the future. Sorry for the confusion.
We had added support about 6 months ago o allow such setups to be detected and allow users to assign a different resource name to each of them (i.e. nvidia.com/rtx-2080 vs nvidia.com/rtx-3090), but it got reverted because our product team wasn’t happy putting arbitrary resource naming in the hands of users.
This is how it would have worked: https://docs.google.com/document/d/1dL67t9IqKC2-xqonMi6DV7W2YNZdkmfX7ibB6Jb-qmk/edit
There is a KEP for dynamic resource allocation. That architecture allows a Pod to find a node where some suitable GPU exists, even where the node has multiple GPUs. Those GPUs can be fixed (even soldered in!), it doesn't have to be a hotplug scenario.
To me, that'd be the way forward for clusters where nodes have a mix of GPUs.
Yes, that is the plan forward. The POC of of our DRA resource driver for GPUs can be found here: https://gitlab.com/nvidia/cloud-native/k8s-dra-driver
It will soon include the notion of a deviceSelector
in the GPUClaimParameters
object so you can do things like:
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
namespace: gpu-test
name: a100
spec:
count: 1
selector:
andExpression:
- productName: "*A100*"
- driverVersion:
value: "460"
operator: GreaterThan
or
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
namespace: gpu-test
name: t4
spec:
count: 1
selector:
andExpression:
- productName: "*T4*"
- driverVersion:
value: "460"
operator: GreaterThan
etc.
Hello,
We're testing gpu-feature-discovery on our DGX machine.
The DGX machine has two types of GPU: one is "NVIDIA-DGX-Display", and the other is "NVIDIA A100-SXM4-80GB" Currently,
gpu.product
andgpu.replicas
nodelabels can hold information of one GPU, literally only one GPU. We're seeing that the values of those two labels are changing periodically: once reflects NVIDIA-DGX-Display, and then reflects NVIDIA-A100-SXM4-80GB, like,nvidia.com/gpu.product: NVIDIA-DGX-Display, nvidia.com/gpu.replicas: 1
<->nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB, nvidia.com/gpu.replicas: 4
It looks like we need to introduce another label that is capable of holding multiple gpu device information.