Polling Nvidia temperature keeps GPU awake

ClementTsang / bottom

Yet another cross-platform graphical process/system monitor.

https://clementtsang.github.io/bottom

MIT License

10.34k stars 248 forks source link

Polling Nvidia temperature keeps GPU awake #1291

Open flukejones opened 1 year ago

flukejones commented 1 year ago

Checklist

[X] I've looked through the documentation and existing open issues for this feature/feature request.

Describe the feature request

I noticed in a recent update the sensors tab (on linux) gained the dGPU temperature. On hybrid systems this is an issue as it causes the dGPU to stay awake and drain battery.

I can't see any easy option to disable this one sensor.

ClementTsang commented 1 year ago

This seems more like a bug - could you fill in the bug report form?

ClementTsang commented 1 year ago

That said, I could also look into adding GPU filtering, yes. Curious how that might look like though - would filtering by PCI info seem too confusing?

ClementTsang commented 1 year ago

Alternatively, I could filter by name + add options to disable any GPU activities for certain GPU names, in addition to more granular filtering for other widgets. Does the current dGPU show up by name in the temperatures tab? If you have a screenshot, that would be helpful.

jamartin9 commented 1 year ago

filter by name + add options to disable any GPU activities for certain GPU names

I like the idea. It should probably be done by index; to avoid device initialization by nvml's device_by_index while getting the name.

Alternatively a white list based approach could support uuid/pcie names pretty easily via device_by_pci_bus_id and device_by_uuid

Edit: Short term build without the gpu feature flag. PR 1276 should allow disabling of the gpu via config until filtering is done. This was probably introduced around 0.7.0

yump commented 11 months ago

Some/all AMD GPUs are also affected. I have an RX580 that doesn't drive any monitors, and reading the hwmons wakes it up and keeps it awake. Unfortunately, it seems the device/power_state file is the only thing I can read without waking the GPU, so in my fan control script I had work around this by modeling the GPU's idle poweroff logic.

The model is an ON/WARM/OFF state machine, where ON reads sensors and utilization, and transitions to WARM if utilization is 0 for some time, and WARM reads no sensors or util but transitions to OFF if the power_state file changes to D3hot, or to ON if it's still in D0 after elapsed time exceeds a value greater than the GPU's idle power off timeout. OFF transitions to ON if power_state shows D0.

Theoretically you could also see D3cold that saves even more power, but the motherboard has to support it somehow and mine seemingly doesn't.

Hmm... It seems that this should perhaps be fixed in the kernel. I have written a note to myself to report this to the hwmon mailing list/bug tracker.