NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.26k stars 1.29k forks source link

High power consumption in idle / unable to turn GPU off #382

Open reibengu opened 2 years ago

reibengu commented 2 years ago

NVIDIA Open GPU Kernel Modules Version

515.76

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Fedora 36

Kernel Release

5.19.12-200.fc36.x86_64

Hardware: GPU

NVIDIA GeForce RTX 3050 Ti Laptop GPU

Describe the bug

If Nvidia drivers are loaded powertop reports ~14W usage on idle while nvidia-smi shows "No running processes found"

If I blacklist nouveau, nvidia and remove the device using udev rules powertop reports ~7W usage on idle.

Note: From time to time it can boot with Nvidia drivers loaded and show ~7W usage on idle, this is rare but possible. If I wake up the GPU using nvidia-smi it will not go back to low power consumption again.

To Reproduce

Install nvidia drivers using this official fedora guide: https://rpmfusion.org/Howto/NVIDIA

Bug Incidence

Always

nvidia-bug-report.log.gz

N/A

More Info

cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power output:

Runtime D3 status:          Enabled (fine-grained)
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Supported
 Video Memory Off:          Supported

Power Limits:
 Default:                   40000 milliwatts
 GPU Boost:                 40000 milliwatts
vans163 commented 2 years ago

Same problem https://forums.developer.nvidia.com/t/idle-power-usage-stuck-at-10-20watts-after-running-an-app/217520

niv commented 2 years ago

Can you please attach the output of nvidia-smi -q, before/after running a graphics app, as well?

Specifically looking for info on the Performance/boost state flag, but all the power info can be useful.

scaledteam commented 1 year ago

I have similar problem in 525.60.11 driver, same videocard (RTX 3050 Ti laptop). I have to use this option to be able to use this driver.

options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

I tried to follow the instructions in documentation, but no success. https://download.nvidia.com/XFree86/Linux-x86_64/525.60.11/README/dynamicpowermanagement.html

Also, my card doesn't show status and everything. And it also ignores temperature limit which was set by manufacturer, and boosts to higher clocks compared to proprietary driver. 75-80 degrees was max temperature in proprietary driver, in open-source one it's is 85-88 degrees.

$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
Runtime D3 status:          ?
Video Memory:               ?

GPU Hardware Support:
 Video Memory Self Refresh: ?
 Video Memory Off:          ?

$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Fri Dec 16 13:27:00 2022
Driver Version                            : 525.60.11
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3050 Ti Laptop GPU
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-ce39c7d8-66ee-e4ef-cf66-4400546963ed
    Minor Number                          : 0
    VBIOS Version                         : 94.07.3B.00.B4
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 25A0-775-A1
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 525.60.11
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x25A010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x0A611028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 4096 MiB
        Reserved                          : 334 MiB
        Used                              : 2 MiB
        Free                              : 3759 MiB
    BAR1 Memory Usage
        Total                             : 4096 MiB
        Used                              : 2 MiB
        Free                              : 4094 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 44 C
        GPU Shutdown Temp                 : 100 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 87 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 6.91 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 5501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 631.250 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1297
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 1 MiB
voidpointertonull commented 1 year ago

And it also ignores temperature limit which was set by manufacturer, and boosts to higher clocks compared to proprietary driver. 75-80 degrees was max temperature in proprietary driver, in open-source one it's is 85-88 degrees.

What makes you believe that this is a problem? Your nvidia-smi output shows GPU Max Operating Temp : 87 C which seems to be in line with the highest temperature you observe, making me suspect that the open source driver behavior is what's correct given a thermally constrained setup. Your observation is interesting though, because I've also observed mostly up to 80 *C on various GPUs despite that not being the maximum temperature, so it's more likely that the proprietary driver is doing something incorrect. It's hard to judge though because it's often quite suspicious that the VRAM is overheating which becomes the limiting factor, it's just not trivial to confirm that as we still don't get to have official support for monitoring that even more than 2 years after the hardware was released.

However it's not clear how is your problem related to the issue posted here. You seem to have 7 W power draw at idle which is while not really great in a laptop, it's what's expected according to the first post, so you don't seem to be effected by this specific issue. If you have issues with unnecessarily high power consumption, then please write more about that. More information could help in case Nvidia ever decides to focus on power efficiency again which seems to have regressed significantly in the recent years.

scaledteam commented 1 year ago

What makes you believe that this is a problem

Proprietary driver says GPU Max Operating Temp: 75 C . But open-source one says 87 C, which is strange.

However it's not clear how is your problem related to the issue posted here

My GPU doesn't go to sleep in idle, but in proprietary driver it working as expected. Probably i should create separate post, because OP says that for him it happens with proprietary driver too, but for me it's only happens with open-source kernel modules.

voidpointertonull commented 1 year ago

Proprietary driver says GPU Max Operating Temp: 75 C

That's surely a better description of the problem, but then if the proprietary driver goes up to 80 *C anyways, then even if it shows the limit in the VBIOS, it's suspicious as earlier mentioned that it doesn't really obey that, while the open source driver may have the wrong limit, but it successfully maxes out the likely not exactly great laptop cooler.

My GPU doesn't go to sleep in idle, but in proprietary driver it working as expected.

What indicates the sleeping status you are looking for? It surely sounds like you have a different issue, but I'm curious that aside from apparently power status reporting not being supported, what makes you believe it's not idling as well as it can. I'm not familiar with the laptop exclusive features as poor power management and poor support of standards just made me disable Nvidia GPUs on work laptops, but your 7 W idle power consumption matches the OP's desired target, and your GPU seems to be P8 state, so I'd expect this to be as good as it gets, but then you imply that the proprietary driver gets better results.

scaledteam commented 1 year ago

What indicates the sleeping status you are looking for?

In proprietary driver GPU transparently goes to sleep after roughly 20 seconds of inactivity, and after that, when i trying to access GPU, program (nvidia-smi, blender, cuda programs) freezes for a second and then starting working normally. I can't even track that, because every request to GPU wakes it up with notisable 1 second freeze. Theoretically it possible to track via this command, or with powertop, but i not tested that, battery life was good in proprietary drivers.

cat /sys/bus/pci/devices/0000\:01\:00.0/power_state

(sorry for editing, i can't get used to this formating style)

voidpointertonull commented 1 year ago

I wouldn't expect the GPU to go to such a deep sleep with a desktop environment running. If you also have an iGPU, then I wouldn't be surprised if you were missing graphics duties being handed over to that so the dGPU could get powered off, but that's a completely different can of worms I'd only dream of opening after dGPU-only operation is already painless.

While I recommend going through with the earlier plan of opening a new issue for your problem, it's unlikely to have a resolution any soon. Aside from the obvious "OpenRmEnableUnsupportedGpus" part making your setup officially not supported, power efficiency appears to be one of the lowest priority issues lately.

scaledteam commented 1 year ago

It's sad, because this open kernel modules is only driver that works with my external RTX 3090 TI via thunderbolt. =( Proprietary driver doesn't work with it at all, because of RmInitAdapter failed error, but open-source one works fine. But it lacks some features for my integrated 3050 TI, like sleep support and power saving features.

Is there an analogue of the OpenRmEnableUnsupportedGpus parameter for a proprietary driver? It might solve all my problems.

voidpointertonull commented 1 year ago

I believe you have a misunderstanding about the purpose of OpenRmEnableUnsupportedGpus. This driver is officially meant to be used only on data center GPUs even if there are way more features than what's needed for that, so you are forced to use a flag as an explicit step of pretty much stating that you understand that even if your setup mostly works, it's officially not supported.

We got really far from OP's problem though, so please really consider opening separate issues, you apparently have even two, the lack of (deep) sleep support which appears to be an issue with this driver, and the problem with the lack of support for our Thunderbolt setup which is not really supposed to be here, but then the Github issue tracker is so much better than the official forum with intentionally broken native browser search support, you could try being cheeky here, just provide a really detailed report then.

To stay somewhat on topic though, if you do have an iGPU and you don't need the extra compute power on the go, then I'd disable the built-in Nvidia dGPU which should give you better power efficiency, and would use the Thunderbolt setup with this driver when stationary.

ejrydhfs commented 1 month ago

i used to have this issue on fedora linux 40 running kernel version 6.10.10. i have upgraded to kernel version 6.11.4 and the issue has gone away but nvidia-smi still reports it is consuming power, yet powertop shows it is not consuming any power.

ejrydhfs commented 1 month ago

i tested it using driver version 560.35.03

ejrydhfs commented 1 month ago

i have also tested the same driver verson on Ubuntu 24.04 running kernel version 6.8 and the issue doesn't seem to be present in Ubuntu either. My graphics card is an RTX 3060 mobile