Open reibengu opened 2 years ago
Can you please attach the output of nvidia-smi -q
, before/after running a graphics app, as well?
Specifically looking for info on the Performance/boost state flag, but all the power info can be useful.
I have similar problem in 525.60.11 driver, same videocard (RTX 3050 Ti laptop). I have to use this option to be able to use this driver.
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
I tried to follow the instructions in documentation, but no success. https://download.nvidia.com/XFree86/Linux-x86_64/525.60.11/README/dynamicpowermanagement.html
Also, my card doesn't show status and everything. And it also ignores temperature limit which was set by manufacturer, and boosts to higher clocks compared to proprietary driver. 75-80 degrees was max temperature in proprietary driver, in open-source one it's is 85-88 degrees.
$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
Runtime D3 status: ?
Video Memory: ?
GPU Hardware Support:
Video Memory Self Refresh: ?
Video Memory Off: ?
$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Fri Dec 16 13:27:00 2022
Driver Version : 525.60.11
CUDA Version : 12.0
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3050 Ti Laptop GPU
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-ce39c7d8-66ee-e4ef-cf66-4400546963ed
Minor Number : 0
VBIOS Version : 94.07.3B.00.B4
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 25A0-775-A1
Module ID : 0
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 525.60.11
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x25A010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x0A611028
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 4
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 4096 MiB
Reserved : 334 MiB
Used : 2 MiB
Free : 3759 MiB
BAR1 Memory Usage
Total : 4096 MiB
Used : 2 MiB
Free : 4094 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 44 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : 87 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : 6.91 W
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 5501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 631.250 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1297
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 1 MiB
And it also ignores temperature limit which was set by manufacturer, and boosts to higher clocks compared to proprietary driver. 75-80 degrees was max temperature in proprietary driver, in open-source one it's is 85-88 degrees.
What makes you believe that this is a problem? Your nvidia-smi output shows GPU Max Operating Temp : 87 C
which seems to be in line with the highest temperature you observe, making me suspect that the open source driver behavior is what's correct given a thermally constrained setup.
Your observation is interesting though, because I've also observed mostly up to 80 *C on various GPUs despite that not being the maximum temperature, so it's more likely that the proprietary driver is doing something incorrect. It's hard to judge though because it's often quite suspicious that the VRAM is overheating which becomes the limiting factor, it's just not trivial to confirm that as we still don't get to have official support for monitoring that even more than 2 years after the hardware was released.
However it's not clear how is your problem related to the issue posted here. You seem to have 7 W power draw at idle which is while not really great in a laptop, it's what's expected according to the first post, so you don't seem to be effected by this specific issue. If you have issues with unnecessarily high power consumption, then please write more about that. More information could help in case Nvidia ever decides to focus on power efficiency again which seems to have regressed significantly in the recent years.
What makes you believe that this is a problem
Proprietary driver says GPU Max Operating Temp: 75 C
. But open-source one says 87 C
, which is strange.
However it's not clear how is your problem related to the issue posted here
My GPU doesn't go to sleep in idle, but in proprietary driver it working as expected. Probably i should create separate post, because OP says that for him it happens with proprietary driver too, but for me it's only happens with open-source kernel modules.
Proprietary driver says
GPU Max Operating Temp: 75 C
That's surely a better description of the problem, but then if the proprietary driver goes up to 80 *C anyways, then even if it shows the limit in the VBIOS, it's suspicious as earlier mentioned that it doesn't really obey that, while the open source driver may have the wrong limit, but it successfully maxes out the likely not exactly great laptop cooler.
My GPU doesn't go to sleep in idle, but in proprietary driver it working as expected.
What indicates the sleeping status you are looking for? It surely sounds like you have a different issue, but I'm curious that aside from apparently power status reporting not being supported, what makes you believe it's not idling as well as it can. I'm not familiar with the laptop exclusive features as poor power management and poor support of standards just made me disable Nvidia GPUs on work laptops, but your 7 W idle power consumption matches the OP's desired target, and your GPU seems to be P8 state, so I'd expect this to be as good as it gets, but then you imply that the proprietary driver gets better results.
What indicates the sleeping status you are looking for?
In proprietary driver GPU transparently goes to sleep after roughly 20 seconds of inactivity, and after that, when i trying to access GPU, program (nvidia-smi, blender, cuda programs) freezes for a second and then starting working normally. I can't even track that, because every request to GPU wakes it up with notisable 1 second freeze. Theoretically it possible to track via this command, or with powertop, but i not tested that, battery life was good in proprietary drivers.
cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
(sorry for editing, i can't get used to this formating style)
I wouldn't expect the GPU to go to such a deep sleep with a desktop environment running. If you also have an iGPU, then I wouldn't be surprised if you were missing graphics duties being handed over to that so the dGPU could get powered off, but that's a completely different can of worms I'd only dream of opening after dGPU-only operation is already painless.
While I recommend going through with the earlier plan of opening a new issue for your problem, it's unlikely to have a resolution any soon. Aside from the obvious "OpenRmEnableUnsupportedGpus" part making your setup officially not supported, power efficiency appears to be one of the lowest priority issues lately.
It's sad, because this open kernel modules is only driver that works with my external RTX 3090 TI via thunderbolt. =( Proprietary driver doesn't work with it at all, because of RmInitAdapter failed error, but open-source one works fine. But it lacks some features for my integrated 3050 TI, like sleep support and power saving features.
Is there an analogue of the OpenRmEnableUnsupportedGpus parameter for a proprietary driver? It might solve all my problems.
I believe you have a misunderstanding about the purpose of OpenRmEnableUnsupportedGpus. This driver is officially meant to be used only on data center GPUs even if there are way more features than what's needed for that, so you are forced to use a flag as an explicit step of pretty much stating that you understand that even if your setup mostly works, it's officially not supported.
We got really far from OP's problem though, so please really consider opening separate issues, you apparently have even two, the lack of (deep) sleep support which appears to be an issue with this driver, and the problem with the lack of support for our Thunderbolt setup which is not really supposed to be here, but then the Github issue tracker is so much better than the official forum with intentionally broken native browser search support, you could try being cheeky here, just provide a really detailed report then.
To stay somewhat on topic though, if you do have an iGPU and you don't need the extra compute power on the go, then I'd disable the built-in Nvidia dGPU which should give you better power efficiency, and would use the Thunderbolt setup with this driver when stationary.
i used to have this issue on fedora linux 40 running kernel version 6.10.10. i have upgraded to kernel version 6.11.4 and the issue has gone away but nvidia-smi still reports it is consuming power, yet powertop shows it is not consuming any power.
i tested it using driver version 560.35.03
i have also tested the same driver verson on Ubuntu 24.04 running kernel version 6.8 and the issue doesn't seem to be present in Ubuntu either. My graphics card is an RTX 3060 mobile
NVIDIA Open GPU Kernel Modules Version
515.76
Does this happen with the proprietary driver (of the same version) as well?
Yes
Operating System and Version
Fedora 36
Kernel Release
5.19.12-200.fc36.x86_64
Hardware: GPU
NVIDIA GeForce RTX 3050 Ti Laptop GPU
Describe the bug
If Nvidia drivers are loaded
powertop
reports ~14W usage on idle whilenvidia-smi
shows "No running processes found"If I blacklist nouveau, nvidia and remove the device using udev rules
powertop
reports ~7W usage on idle.Note: From time to time it can boot with Nvidia drivers loaded and show ~7W usage on idle, this is rare but possible. If I wake up the GPU using
nvidia-smi
it will not go back to low power consumption again.To Reproduce
Install nvidia drivers using this official fedora guide: https://rpmfusion.org/Howto/NVIDIA
Bug Incidence
Always
nvidia-bug-report.log.gz
N/A
More Info
cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
output: