Open dagbdagb opened 6 months ago
Right. So 6.6.30 also fails with the open driver. If this always was the case, then the bug appears to be with the closed driver. And whatever we are looking for happened between kernel 6.7.9 and 6.8.9. sigh. And the entire ticket belongs somewhere else, I presume?
And the entire ticket belongs somewhere else, I presume?
Yes, here: https://forums.developer.nvidia.com/c/gpu-graphics/linux/148
Seeing how this still is open, I might as well continue here.
In the light of this driver being considered as the default in the linux nvidia-drivers, I would like to point out that in order to get RTD3/D3cold working with my Turing 2070 mobile, I must:
Any other combination ends up with "Runtime D3 status: Not supported".
This applies to kernel version 6.9.2-gentoo and nvidia-drivers 555.42.02.
I will happily provide an updated nvidia-bug-report.log.gz if required. If so, let me know if you want it with a particular combo of driver and driver options enabled.
Since you're on gentoo, can you try 6.1.x kernels ? (especially this one since it works for me with this version 6.1.92)
I seem to have some issues with D3cold aswell.
I can, but is there any point to it? 6.1 is a longterm kernel, sure. But so is 6.6, which is way more recent. Also, try what exactly? Open driver with GPU firmware loading? Does this combo enable D3cold for you? And if so, does it still enter D3cold after a suspend cycle?
Hey there, sorry for the late reply! In the driver readme kernel_open section it says:
Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...
- Run Time D3 (RTD3) is only supported on Ampere and above GPUs.
This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.
In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.
Thanks for understanding.
Hey there, sorry for the late reply! In the driver readme kernel_open section it says:
Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled: ...
- Run Time D3 (RTD3) is only supported on Ampere and above GPUs.
This isn't a "bug that needs fixing" kind of issue, it's more of a "feature is entirely missing and needs to be coded from scratch". Unlike Ampere+, the proprietary non-GSP implementation of Turing RTD3 doesn't map well to GSP and would require a large effort to enable. I can't give any ETA or anything, but considering that this was never a default-enabled feature even on proprietary, I imagine the priority is gonna be lower than other regressions.
In the meantime, you might want to stay with the proprietary driver with GSP disabled if this is a dealbreaker for you.
Thanks for understanding.
I see.
The effort required is with the firmware, is that it? And yes, dropping the laptop power consumption with 5-6W is fairly essential. Both for the heat and the fan noise.
Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?
Any chance of nvidia publishing a live list of items being worked on / prioritized for the next driver release?
Honestly? No, no chance. Hard enough to come by that information internally even, but also aside from that historically we've had a very bad time when these publicly shared ETAs slip even by just a few days.
I'm afraid the only straight answer you're gonna get is roughly: "Known issue. Not easy fix. No ETA. Low priority. Here's a workaround (proprietary+disable GSP)". Anything else I could say would be so full of weasel words that it might as well be left unsaid.
Sorry, I know it's not what you want to hear, but it is what it is.
Sorry, I know it's not what you want to hear, but it is what it is.
You're right, @mtijanic . Hate the message, appreciate the messenger.
So, to sum it up:
Bah.
For anyone else finding this: Even with the proprietary driver and GSP disabled, RTD3 on Turing is finicky. A suspend/resume cycle may in some cases cause the card to not enter D3cold again.
Edit: This seems to be a weird sysfs thing; I was looking at the wrong file (/sys/class/drm/card1/device/power/runtime_status
[correct] vs /sys/class/drm/card1/power/runtime_status
[reports something else, apparently]). Runtime PM is indeed enabled, but doesn't work for... reasons?
Moar Edit: If you value your battery life, do not set nvidia_drm.fbdev=1
.
Original: Unless I'm missing something critical (which I may well be), this issue now seems to affect the proprietary kernel modules as well. I've been running an Nvidia-driven display on my hybrid-GPU laptop until very recently, so I can't say exactly when things changed, but here's what I'm currently seeing on the v555.58.02 proprietary modules:
$ modinfo nvidia | rg license
license: NVIDIA
$ modprobe nvidia --showconfig | rg NVreg
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia "NVreg_EnableGpuFirmware=0"
options nvidia "NVreg_DynamicPowerManagement=0x02"
$ cat /sys/class/drm/card1/power/runtime_status
unsupported
And, indeed, the GPU stays in D0 even when it has been able to switch to D3Cold previously (unplugged from wall power, no external display connected, no programs using it).
Is this a known/expected regression?
@LRitzdorf
Is this a known/expected regression?
I don't think so, with options nvidia "NVreg_EnableGpuFirmware=0"
; can you verify it is actually disabled? Run:
nvidia-smi -q | grep GSP
If it gives you N/A
it's disabled, and if it gives a version number then that param had no effect.
Anyway, if it is actually disabled, please shoot a bug report to linux-bugs@nvidia.com
, since it has nothing to do with this repo here.
NVIDIA Open GPU Kernel Modules Version
550.78
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Gentoo Linux x86_64 6.7.9-gentoo
Kernel Release
6.7.9-gentoo, own config
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
NVIDIA GeForce RTX 2070 with Max-Q Design
Describe the bug
I noticed my laptop was slightly warmer than expected. This on 6.8.9-gentoo. A number of reboots later, I can state that :
... is the result, if the nvidia-drivers package is built with the kernel-open flag in gentoo, running gentoo-sources-6.7.9.
If built with
-kernel-open
(leading '-' implies 'no') I have fine-grained control again.HOWEVER, please also note: I also tried both variants (open/closed kernel driver) on 6.8.9, and there I get 'Not supported' in both cases'.
I have not bisected the issue to a particular kernel version. I just happened to have 6.7.9 on disk.
To Reproduce
run gentoo
install gentoo-sources-6.7.9
build/install kernel
build install nvidia-driver:
driver options:
udev rules:
Remove NVIDIA USB Type-C UCSI devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{remove}="1"
Remove NVIDIA Audio devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{remove}="1"
Enable runtime PM for NVIDIA VGA/3D controller devices on driver bind
ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="auto" ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="auto"
Disable runtime PM for NVIDIA VGA/3D controller devices on driver unbind
ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="on" ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="on"