NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 291 forks source link

[Feature] Support to set power limit through the gpu-operator #942

Open Krast76 opened 2 months ago

Krast76 commented 2 months ago

1. Issue or feature description

Cause I have a "power supply" limits I must set power limits on GPU, before Kubernetes i'm used to do it with nvidia-smi :

nvidia-smi -pl $(MY_POWER_LIMITS)

In a Kubernetes environnement where the nodes are created and destroyed many time per day I would like to see that managed by the gpu-operator.

I took a look at kernel documentation but I have found nothing to manage that through kernel parameters.

Thanks

lengrongfu commented 1 week ago

@cdesiniotis Do we need this feature? If so, I can contribute.

cdesiniotis commented 1 week ago

@Krast76 when using the gpu-operator, the driver is installed at /run/nvidia/driver on the host. So you can change the "power supply" limit by running sudo chroot /run/nvidia/driver nvidia-smi -pl ${POWER_LIMIT} on the host or exec'ing into the driver daemonset pod and running nvidia-smi -pl ${POWER_LIMIT}. Does that help?

lengrongfu commented 1 week ago

Can we provide a command that can be executed by the user in the driver daemonset after the driver install succeeds to make it more universal?

Krast76 commented 6 days ago

@Krast76 when using the gpu-operator, the driver is installed at /run/nvidia/driver on the host. So you can change the "power supply" limit by running sudo chroot /run/nvidia/driver nvidia-smi -pl ${POWER_LIMIT} on the host or exec'ing into the driver daemonset pod and running nvidia-smi -pl ${POWER_LIMIT}. Does that help?

This is what I did with static nodes. Since I have autoscaling nodes I can't set the power limit by hands. To do so, I made a quick and "dirty" daemonset to handle that case : https://github.com/Krast76/k8s-nvidia-power-limiter. I use it as a DaemonSet since september and it works like a charm.

Currently, like I said, it's quick and dirty code, if I find the time I'll add a documentation and an example of how to run it and perharps refactor the code (better logging etc)