I have a customer ... using nvidia GPU operator, alongside some custom controller (fabric8), that reads the nvidia.com/gpu.memory label added by gpu feature discovery, then patching node objects adding some nvidia.com/gpu.memory entry in node capacity/allocatable resources.
I was surprised to see this is not managed by nvidia operator OOB.
Setting this, our clusters end-users are then able to schedule pods without requesting GPU cores explicitly - thus, a single GPU core may be used by more than one containers.
Any plan to implement something similar?
I don't think I can share my customer's code, and I'm not sure java code would help here ... For the record, while the following adds a label to nodes ( https://github.com/NVIDIA/gpu-feature-discovery/blob/main/internal/lm/resource.go#L36-L73 ), we might be able to patch the corresponding Node's status.capacity, adding or patching an entry for resource named nvidia.com/gpu.memory.
Hey,
I have a customer ... using nvidia GPU operator, alongside some custom controller (fabric8), that reads the
nvidia.com/gpu.memory
label added by gpu feature discovery, then patching node objects adding somenvidia.com/gpu.memory
entry in node capacity/allocatable resources.I was surprised to see this is not managed by nvidia operator OOB.
Setting this, our clusters end-users are then able to schedule pods without requesting GPU cores explicitly - thus, a single GPU core may be used by more than one containers.
Any plan to implement something similar? I don't think I can share my customer's code, and I'm not sure java code would help here ... For the record, while the following adds a label to nodes ( https://github.com/NVIDIA/gpu-feature-discovery/blob/main/internal/lm/resource.go#L36-L73 ), we might be able to patch the corresponding Node's
status.capacity
, adding or patching an entry for resource namednvidia.com/gpu.memory
.Thanks!