ClementTsang / bottom

Yet another cross-platform graphical process/system monitor.
https://clementtsang.github.io/bottom
MIT License
9.75k stars 230 forks source link

Polling Nvidia temperature keeps GPU awake #1291

Open flukejones opened 1 year ago

flukejones commented 1 year ago

Checklist

Describe the feature request

I noticed in a recent update the sensors tab (on linux) gained the dGPU temperature. On hybrid systems this is an issue as it causes the dGPU to stay awake and drain battery.

I can't see any easy option to disable this one sensor.

ClementTsang commented 1 year ago

This seems more like a bug - could you fill in the bug report form?

ClementTsang commented 1 year ago

That said, I could also look into adding GPU filtering, yes. Curious how that might look like though - would filtering by PCI info seem too confusing?

ClementTsang commented 1 year ago

Alternatively, I could filter by name + add options to disable any GPU activities for certain GPU names, in addition to more granular filtering for other widgets. Does the current dGPU show up by name in the temperatures tab? If you have a screenshot, that would be helpful.

jamartin9 commented 1 year ago

filter by name + add options to disable any GPU activities for certain GPU names

I like the idea. It should probably be done by index; to avoid device initialization by nvml's device_by_index while getting the name.

Alternatively a white list based approach could support uuid/pcie names pretty easily via device_by_pci_bus_id and device_by_uuid

Edit: Short term build without the gpu feature flag. PR 1276 should allow disabling of the gpu via config until filtering is done. This was probably introduced around 0.7.0

yump commented 9 months ago

Some/all AMD GPUs are also affected. I have an RX580 that doesn't drive any monitors, and reading the hwmons wakes it up and keeps it awake. Unfortunately, it seems the device/power_state file is the only thing I can read without waking the GPU, so in my fan control script I had work around this by modeling the GPU's idle poweroff logic.

The model is an ON/WARM/OFF state machine, where ON reads sensors and utilization, and transitions to WARM if utilization is 0 for some time, and WARM reads no sensors or util but transitions to OFF if the power_state file changes to D3hot, or to ON if it's still in D0 after elapsed time exceeds a value greater than the GPU's idle power off timeout. OFF transitions to ON if power_state shows D0.

Theoretically you could also see D3cold that saves even more power, but the motherboard has to support it somehow and mine seemingly doesn't.

Hmm... It seems that this should perhaps be fixed in the kernel. I have written a note to myself to report this to the hwmon mailing list/bug tracker.

ClementTsang commented 9 months ago

bottom already actually does a fairly simple check with device/power_state, and only grabbing further sensor data if it either did not exist, or was D0/unknown, so yeah I might need to make it a bit more sophisticated with checks.... that or my implementation is bugged. It's a bit frustrating too since I don't think I have any way to debug this at the moment.

ClementTsang commented 9 months ago

If anyone can check, would be interested to see if a simple logic change in https://github.com/ClementTsang/bottom/pull/1355 helps with it.

flukejones commented 9 months ago

@ClementTsang I've tried that branch, is it supposed to show Nvidia/GPU temps if it is already active? Currently it does not.

ClementTsang commented 9 months ago

The change would hide any entry for any device that's asleep; if it turns back on though in theory it should show up again...

ClementTsang commented 9 months ago

Mostly also just curious whether it stops the GPU from waking, or if there's more that I need to do in that part first.

flukejones commented 9 months ago

Mostly also just curious whether it stops the GPU from waking, or if there's more that I need to do in that part first.

Seems like I don't.

ClementTsang commented 9 months ago

Hm, so the GPU is still waking up?

flukejones commented 9 months ago

Sorry mate. It looks like I had a brainfart.. The dgpu appears to not be waking.

ClementTsang commented 8 months ago

Just merged #1355, could you see in main if the output looks reasonable for you and doesn't wake up the dgpu? Thanks!

flukejones commented 8 months ago

It doesn't wake it, but also does not show details if it is awake? It may be worth reading through this also https://gitlab.com/mission-center-devs/mission-center/-/issues/30#note_1697130114

ClementTsang commented 8 months ago

Hmm... that's weird, thanks for the link. Also just curious, could you provide screenshots of what the temp table looks like on stable and on main now? Thanks!

ClementTsang commented 8 months ago

:facepalm: just realized that I never changed the sleep checks for nvidia GPUs... let me try looking at that too.