Is the pm method working with kernel 5.3/5.4?

Moitaeel commented 4 years ago

Is anyone on "lastest" kernel having problems with the card not powering off?

I tried several times updating to kernel 5.3, recently 5.3.11, and everytime I try my GTX1060 Max-Q starts warming up despite being removed by the nvidia-xrun-pm service, not being listed in lspci and no nvidia module being present.

I tried several kernel updates and I have a suspicion kernel updates could be linked to this behavior. It also increase my suspicion that I started having this problems as soon as Nvidia announced their new Xorg off-load updates.

If I understood correctly the nvidia staff on their forums, Xorg off-load on older card, anything other than RTX, would not power-off when using xorg offload because they did not have the correct ACPI to power-off. But did the card behavior in recent kernels changed to ignore being called upon by some module and just waking up nonetheless?

EDIT: Issue likely solved for good on kernel 5.7, likely these problems were related to problems in the 0x1901 Intel bridge controller.

michelesr commented 4 years ago

Is still working for me on a XPS with Nvidia GTX 1050Ti and kernel 5.3.11

Moitaeel commented 4 years ago

Thanks for the feedback michelesr, if it's not the kernel then something on my build it's waking the card other than modules, I'll try again or do a fresh install.

Moitaeel commented 4 years ago

Could I still use this issue ticket to ask for guidance in debugging what is keeping my card on even when it was removed from pci? I know now kernel isn't the issue but still I'm struggling to figure out how to solve the power management problem, after updating my temps are 10º hotter but everything seems to be as it should.

The nvidia-xrun-pm service removed the card:

$ lspci
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1d.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #14 (rev f0)
00:1d.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #15 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a30d (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03)
3c:00.0 Ethernet controller: Qualcomm Atheros Killer E2400 Gigabit Ethernet Controller (rev 10)
3d:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)

No active nvidia module: https://pastebin.com/VS2rs4U4

My blacklist.conf in modprobe.d is the same as ArchWiki:

blacklist nouveau
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia-uvm

Everything is working on nvidia-xrun, including vulkan, I even removed conky and libxnvctrl just to try and figure out what's waking up the card at boot but no sucess. Any ideas?

michelesr commented 4 years ago

Have you tried checking the tunables in powertop? If the bus id is set wrongly in the nvidia-xrun config file then pm won't be enabled for the graphic card controller... by playing with the tunables there you should be able to toggle on/off the PM for the PCI Express controller and then see also which bus id to feed into nvidia-xrun config so that it manages that for you.

Also lshw should tell you the controller bus id that is right before the one of the graphic card itself.

Moitaeel commented 4 years ago

I opened powertop and I couldn't find the graphics, so I disabled the nvidia-xrun-pm service and rebooted to check the right bus id. Reopening powertop showed the card and I changed to good but I did not noticed any temperature change. Doing cat /sys/bus/pci/devices/0000\:01\:00.0/power/control returns auto, so I guess it did what it was supposed.

lspci with nvidia-xrun-pm service disabled shows the bus id for the card is 01:00.0, from lshw (pastebin), the bus is pci@0000:01:00.0. https://pastebin.com/BPjT5APU :

$ lspci
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1d.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #14 (rev f0)
00:1d.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #15 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a30d (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03)
3c:00.0 Ethernet controller: Qualcomm Atheros Killer E2400 Gigabit Ethernet Controller (rev 10)
3d:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)

My /etc/default/nvidia-xrun:

# When enabled, nvidia-xrun will turn the card on before attempting to load the
# modules and running the command, and turn it off after the commands exits and
# the modules gets unloaded. If order for this to work, CONTROLLER_BUS_ID and
# DEVICE_BUS_ID must be set correctly. IDs can be found by by inspecting the
# output of lshw.
ENABLE_PM=1

# When PM is enabled, remove the card from the system after the command exists
# and modules unload: the card will be readded in the next nvidia-xrun
# execution before loading the nvidia module again. This is recommended as Xorg
# and some other programs tend to load the nvidia module if they detect a
# nvidia card in the system, and when the module is loaded the card can't save
# power.
REMOVE_DEVICE=1

# Bus ID of the PCI express controller
CONTROLLER_BUS_ID=0000:00:01.0

# Bus ID of the graphic card
DEVICE_BUS_ID=0000:01:00.0

# Seconds to wait before turning on the card after PCI devices rescan
BUS_RESCAN_WAIT_SEC=1

# Ordered list of modules to load before running the command
MODULES_LOAD=(nvidia nvidia_uvm nvidia_modeset "nvidia_drm modeset=1")

# Ordered list of modules to unload after the command exits
MODULES_UNLOAD=(nvidia_drm nvidia_modeset nvidia_uvm nvidia)

From my /etc/X11/nvidia-xorg.conf (I installed nvidia-xrun-git from AUR), I guess I have the correct bus id.

Section "Files"
  ModulePath "/usr/lib/nvidia"
  ModulePath "/usr/lib32/nvidia"
  ModulePath "/usr/lib32/nvidia/xorg/modules"
  ModulePath "/usr/lib32/xorg/modules"
  ModulePath "/usr/lib64/nvidia/xorg/modules"
  ModulePath "/usr/lib64/nvidia/xorg"
  ModulePath "/usr/lib64/xorg/modules"
EndSection

Section "ServerLayout"
  Identifier "layout"
  Screen 1 "nvidia"
  Inactive "intel"
EndSection

Section "Device"
  Identifier "nvidia"
  Driver "nvidia"
  BusID "PCI:1:0:0"
EndSection

Section "Screen"
  Identifier "nvidia"
  Device "nvidia"
#  Option "AllowEmptyInitialConfiguration" "Yes"
#  Option "UseDisplayDevice" "none"
EndSection

Section "Device"
  Identifier "intel"
  Driver "modesetting"
  Option "AccelMethod" "none"
EndSection

Section "Screen"
  Identifier "intel"  
  Device "intel"
EndSection

michelesr commented 4 years ago

AFAIK, if the following requirements are met, the card should be turned off:

bus and card power controls are set to auto
the card is not used by any kernel module

So... if you blacklisted the module and you can verify they aren't loaded, and also the power controls for both the card and the PCI controller are set to auto, it should be okay.

Try to set the power control for the PCI controller 0000:00:01.0 to on and see if powertop reports an increased power utilization.

Moitaeel commented 4 years ago

Hello again michelesr,

I reverted a snapshot to have some comparison numbers between both states (on and off) on the working kernel/driver..

With the normal snapshot, I then did some tests on power control on both 0000:00:01.0 (controller) and 0000:01:00.0 (card). I disabled the nvidia-xrun service and then did cat /sys/bus/pci/devices/0000\:00\:01.0/power/control which was in auto and cat /sys/bus/pci/devices/0000\:01\:00.0/power/control which was on. The power consumption was about 19W at powertop and idle temps around 43°. I then turned the card to auto with echo "auto" > /sys/bus/pci/devices/0000\:01\:00..0/power/control but that did not turned the power down, so I turned the controller on with echo "on" > /sys/bus/pci/devices/0000\:00\:01.0/power/control and then redid the command again now to auto, and then the power went down to 10W and temps to 38°.

So I then update only the kernel to 5.3.11-1 and the drivers to 440.31-1, uninstalled nvidia-xrun, rebooted and tested it again, the power consumption is at ~21W, temps at 50° and I did the same commands but now the power doesn't change at all, no matter the combination.

Again, the controller is at auto when I boot, the card is on, I changed the card to auto, nothing, I changed the controller to on, nothing, changed back to auto, nothing, it doesn't change the power consumption both at powertop and s-tui and temps are at around 50°.

I have a question, if the card is working, changing the /sys/bus/pci/devices/0000\:01\:00..0/power/control or the controller would turn the card off or would the card ignore the kernel command?

Also, since power consumption and iddle temps are a bit higher then before with the working snapshot when testing both card and controller on, I guess something is being processed by the card (maybe a bug in the driver) and stopping me from shutting it off, is that a correct assumption?

Any ideas?

Moitaeel commented 4 years ago

I tried to nvidia-smi --query-compute-apps=name --format=csv,noheader and then nvidia-smi -q --display=POWER but no processes are active but still full power.

michelesr commented 4 years ago

If nvidia-smi is able to retrieve information, it probably means the nvidia driver is loaded. That will force the card to stay on.

When the systemd service is run properly at boot, or after quitting a nvidia-xrun X session, the module nvidia shouldn't be loaded in the system, the card shouldn't be available in lspci output, and nvidia-smi should return this:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Moitaeel commented 4 years ago

My bad, I failed to comment that last time that I had disable and then uninstalled nvidia-xrun in order to test manually powering off the card using the /sys/bus/pci/devices/0000\:01\:00..0/power/control. I checked again after updating from a snapshot and nvidia related modules are absent from lsmod.

However I noticed as well a difference in my blacklist.conf from the archwiki, I don't know if it would cause a problem but since I copied my blacklist.conf from an old build, I did not noticed that some modules changed their naming from underline to hyphen, so I fixed that. Still the problem persisted on 5.3.11-1 so that change had no effect on my problem.

But I found a topic at Manjaro Forum from an user with a very similar setup to mine mentioning their optimus-switch stopper working on kernel 5.3. While the method used there is very different from nvidia-xrun, the close specs and mentioning that problems were likely to changes in ACPI calls in kernel 5.3 raised a red flag for me that some other changes in kernel 5.3 might have borked nvidia-xrun for me. https://forum.manjaro.org/t/call-for-testing-optimus-switch/75773/220 https://forum.manjaro.org/t/optimus-switch-on-kernel-5-3-lock-up-on-login/111357

That user ended up using kernel 5.4, so I installed it aswell (5.4rc7.d1117.g1d4c79e-1) and it fixed my issue, now everything is working again, at least for now since it's still experimental.

I guess this might be an issue related to some very specific configurations, I have a Dell laptop with an Intel Core i7-8750H and GTX 1060 Mobile with MaxQ, the user there has an Gigabyte laptop with the same processor but GTX 1070 Mobile with Max Q.

michelesr commented 4 years ago

I guess there's not very much that can be done on the nvidia-xrun side, if it's an upstream issue with either nvidia or the kernel PM. It may be worthy to see if these issues are reported upstream.

AFAIK, the way PM works for the nvidia card is that if everything is set to auto and there aren't module that use the card, then the kernel should power the card off. On my system (Dell XPS 9570 with GTX 1050Ti) is quite evident cuz the fans are always kept running by the BIOS when the card is on, so I know that if the fans are off then the card must be off too, also the power usage drops in powertop.

I'm not a big nvidia user, so sometimes I uninstall nvidia-xrun, and simply blacklist the modules, and that's enough to keep the card off the grid. In fact, the reason the card needs to be removed by the list it's simply to avoid nvidia module to be loaded by third parties like a desktop environment or the X server (when using the modesetting driver).

Moitaeel commented 4 years ago

I tried searching on Bugzilla and on the kernel 5.4 mailist for something that would match the problem, there's some power management changes but I lack the technical knowledge to understand what is relevant. http://lkml.iu.edu/hypermail/linux/kernel/1909.2/01259.html

I'll close the issue, I just tested kernel 5.3.11-1 and is still problematic but it's working on kernel 5.4 so I guess there`s a solution incoming, if problems arise on kernel 5.4 then it will also be easier to narrow down which change is causing the issue.

Moitaeel commented 4 years ago

Hello again,

I reopened the issue because the problem returned on kernel 5.4.2. I reopened the issue so that it can be easily found by anyone having problems with nvidia-xrun due to this kernel bug. I'll report this to bugzilla. Since this is a upstream bug, if this issue ticket here is inadequate, please close it.

Moitaeel commented 4 years ago

I'm closing this issue again since the problem is not happening at kernel 5.5rc3, and there were already bug reports being worked on that I think are related to the problem, since I share the same bus controller.

EDIT: Issue likely solved for good on kernel 5.7, likely these problems were related to problems in the 0x1901 Intel bridge controller.

Witko / nvidia-xrun

Is the pm method working with kernel 5.3/5.4? #142