NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.54k stars 582 forks source link

`nvidia.com/gpu.memory` capacity #742

Open faust64 opened 1 year ago

faust64 commented 1 year ago

Hey,

I have a customer ... using nvidia GPU operator, alongside some custom controller (fabric8), that reads the nvidia.com/gpu.memory label added by gpu feature discovery, then patching node objects adding some nvidia.com/gpu.memory entry in node capacity/allocatable resources.

I was surprised to see this is not managed by nvidia operator OOB.

Setting this, our clusters end-users are then able to schedule pods without requesting GPU cores explicitly - thus, a single GPU core may be used by more than one containers.

Any plan to implement something similar? I don't think I can share my customer's code, and I'm not sure java code would help here ... For the record, while the following adds a label to nodes ( https://github.com/NVIDIA/gpu-feature-discovery/blob/main/internal/lm/resource.go#L36-L73 ), we might be able to patch the corresponding Node's status.capacity, adding or patching an entry for resource named nvidia.com/gpu.memory.

Thanks!

elezar commented 1 year ago

/cc @klueska