NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.06k stars 1.25k forks source link

[555.42.02] D3cold on Turing Mobile not working with kernel 6.9.2. Works with closed driver. #640

Open dagbdagb opened 4 months ago

dagbdagb commented 4 months ago

NVIDIA Open GPU Kernel Modules Version

550.78

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Gentoo Linux x86_64 6.7.9-gentoo

Kernel Release

6.7.9-gentoo, own config

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 2070 with Max-Q Design

Describe the bug

I noticed my laptop was slightly warmer than expected. This on 6.8.9-gentoo. A number of reboots later, I can state that :

dagb@gillette:~ (20:32) $ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power 
Runtime D3 status:          Not supported
Video Memory:               Active

GPU Hardware Support:
 Video Memory Self Refresh: Not Supported
 Video Memory Off:          Supported

... is the result, if the nvidia-drivers package is built with the kernel-open flag in gentoo, running gentoo-sources-6.7.9.

If built with -kernel-open (leading '-' implies 'no') I have fine-grained control again.

HOWEVER, please also note: I also tried both variants (open/closed kernel driver) on 6.8.9, and there I get 'Not supported' in both cases'.

I have not bisected the issue to a particular kernel version. I just happened to have 6.7.9 on disk.

To Reproduce

Remove NVIDIA USB Type-C UCSI devices, if present

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{remove}="1"

Remove NVIDIA Audio devices, if present

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{remove}="1"

Enable runtime PM for NVIDIA VGA/3D controller devices on driver bind

ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="auto" ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="auto"

Disable runtime PM for NVIDIA VGA/3D controller devices on driver unbind

ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="on" ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="on"



### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/NVIDIA/open-gpu-kernel-modules/files/15287343/nvidia-bug-report.log.gz)

### More Info

I *think* 6.6.30 works with both open and closed kernel driver. Will give it another spin to verify, and update this ticket with the result.
dagbdagb commented 4 months ago

Right. So 6.6.30 also fails with the open driver. If this always was the case, then the bug appears to be with the closed driver. And whatever we are looking for happened between kernel 6.7.9 and 6.8.9. sigh. And the entire ticket belongs somewhere else, I presume?

ttabi commented 4 months ago

And the entire ticket belongs somewhere else, I presume?

Yes, here: https://forums.developer.nvidia.com/c/gpu-graphics/linux/148

dagbdagb commented 3 months ago

Seeing how this still is open, I might as well continue here.

In the light of this driver being considered as the default in the linux nvidia-drivers, I would like to point out that in order to get RTD3/D3cold working with my Turing 2070 mobile, I must:

Any other combination ends up with "Runtime D3 status: Not supported".

This applies to kernel version 6.9.2-gentoo and nvidia-drivers 555.42.02.

I will happily provide an updated nvidia-bug-report.log.gz if required. If so, let me know if you want it with a particular combo of driver and driver options enabled.

XutaxKamay commented 3 months ago

Since you're on gentoo, can you try 6.1.x kernels ? (especially this one since it works for me with this version 6.1.92)

I seem to have some issues with D3cold aswell.

dagbdagb commented 3 months ago

I can, but is there any point to it? 6.1 is a longterm kernel, sure. But so is 6.6, which is way more recent. Also, try what exactly? Open driver with GPU firmware loading? Does this combo enable D3cold for you? And if so, does it still enter D3cold after a suspend cycle?

mtijanic commented 3 months ago

Hey there, sorry for the late reply! In the driver readme kernel_open section it says:

Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...

  • Run Time D3 (RTD3) is only supported on Ampere and above GPUs.

This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.

In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.

Thanks for understanding.

dagbdagb commented 3 months ago

Hey there, sorry for the late reply! In the driver readme kernel_open section it says:

Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...

  • Run Time D3 (RTD3) is only supported on Ampere and above GPUs.

This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.

In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.

Thanks for understanding.

I see.

The effort required is with the firmware, is that it? And yes, dropping the laptop power consumption with 5-6W is fairly essential. Both for the heat and the fan noise.

Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?

mtijanic commented 3 months ago

Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?

Honestly? No, no chance. Hard enough to come by that information internally even, but also aside from that historically we've had a very bad time when these publicly shared ETAs slip even by just a few days.

I'm afraid the only straight answer you're gonna get is roughly: "Known issue. Not easy fix. No ETA. Low priority. Here's a workaround (proprietary+disable GSP)". Anything else I could say would be so full of weasel words that it might as well be left unsaid.

Sorry, I know it's not what you want to hear, but it is what it is.

dagbdagb commented 3 months ago

Sorry, I know it's not what you want to hear, but it is what it is.

You're right, @mtijanic . Hate the message, appreciate the messenger.

So, to sum it up:

Bah.

For anyone else finding this: Even with the proprietary driver and GSP disabled, RTD3 on Turing is finicky. A suspend/resume cycle may in some cases cause the card to not enter D3cold again.

izmyname commented 3 months ago

For anyone else finding this: Even with the proprietary driver and GSP disabled, RTD3 on Turing is finicky. A suspend/resume cycle may in some cases cause the card to not enter D3cold again.

Novideo being novideo. Not a single release without xid errors, regressions, kernel panics during shutdown and not properly working functionality.

LRitzdorf commented 1 month ago

Edit: This seems to be a weird sysfs thing; I was looking at the wrong file (/sys/class/drm/card1/device/power/runtime_status [correct] vs /sys/class/drm/card1/power/runtime_status [reports something else, apparently]). Runtime PM is indeed enabled, but doesn't work for... reasons?

Moar Edit: If you value your battery life, do not set nvidia_drm.fbdev=1.

Original: Unless I'm missing something critical (which I may well be), this issue now seems to affect the proprietary kernel modules as well. I've been running an Nvidia-driven display on my hybrid-GPU laptop until very recently, so I can't say exactly when things changed, but here's what I'm currently seeing on the v555.58.02 proprietary modules:

$ modinfo nvidia | rg license
license:        NVIDIA

$ modprobe nvidia --showconfig | rg NVreg
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia "NVreg_EnableGpuFirmware=0"
options nvidia "NVreg_DynamicPowerManagement=0x02"

$ cat /sys/class/drm/card1/power/runtime_status
unsupported

And, indeed, the GPU stays in D0 even when it has been able to switch to D3Cold previously (unplugged from wall power, no external display connected, no programs using it).

Is this a known/expected regression?

mtijanic commented 1 month ago

@LRitzdorf

Is this a known/expected regression?

I don't think so, with options nvidia "NVreg_EnableGpuFirmware=0"; can you verify it is actually disabled? Run:

nvidia-smi -q | grep GSP

If it gives you N/A it's disabled, and if it gives a version number then that param had no effect.

Anyway, if it is actually disabled, please shoot a bug report to linux-bugs@nvidia.com, since it has nothing to do with this repo here.