Open Xiaoming94 opened 7 years ago
I have been seeing the same issue for probably at least a year now, also on Arch. It would be nice to see this fixed, as I have been unable to use the nvidia card in my laptop for many months!
optirun verbose output
$ optirun -vvv glxgears -info
[323488.291664] [DEBUG]Reading file: /etc/bumblebee/bumblebee.conf
[323488.292166] [DEBUG]optirun version 3.2.1 starting...
[323488.292181] [DEBUG]Active configuration:
[323488.292195] [DEBUG] bumblebeed config file: /etc/bumblebee/bumblebee.conf
[323488.292211] [DEBUG] X display: :8
[323488.292224] [DEBUG] LD_LIBRARY_PATH: /usr/lib/nvidia:/usr/lib32/nvidia
[323488.292236] [DEBUG] Socket path: /var/run/bumblebee.socket
[323488.292249] [DEBUG] Accel/display bridge: virtualgl
[323488.292263] [DEBUG] VGL Compression: proxy
[323488.292276] [DEBUG] VGLrun extra options:
[323488.292289] [DEBUG] Primus LD Path: /usr/lib/primus:/usr/lib32/primus
[323488.346264] [INFO]Response: No - error: Could not load GPU driver
[323488.346282] [ERROR]Cannot access secondary GPU - error: Could not load GPU driver
[323488.346286] [DEBUG]Socket closed.
[323488.346305] [ERROR]Aborting because fallback start is disabled.
[323488.346313] [DEBUG]Killing all remaining processes.
dmesg output
[323497.868821] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[323497.869310] NVRM: The NVIDIA GPU 0000:02:00.0
NVRM: (PCI ID: 10de:0fe4) installed in this system has
NVRM: fallen off the bus and is not responding to commands.
[323497.869366] nvidia: probe of 0000:02:00.0 failed with error -1
[323497.869396] NVRM: The NVIDIA probe routine failed for 1 device(s).
[323497.869397] NVRM: None of the NVIDIA graphics adapters were initialized!
[323497.869502] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
Confirmed on kernel 4.13.5
@alexforencich The problem occurs as soon as the device is suspended (either with bbswitch or with runtime PM since kernel 4.8). The problem is not specific to Bumblebee :-/
So the workaround currently is to use the GPU fulltime as suggested by Nvidia themselves?
If you don't care about your battery, then that is an option... personally I do care and use the nouveau driver instead of Bumblebee/bbswitch. That allows me to connect external monitors and suspend the GPU in other cases.
What you could try as alternative is to remove the PCI device and rescan before you load nvidia (but after the power is restored).
How're you supposed to do that on a laptop?
And if this is a known regression since kernel version 4.8, why hasn't it been rolled back yet?
And if this is a known regression since kernel version 4.8, why hasn't it been rolled back yet?
Because it fixes other problems. Without this new approach, some Lenovo laptops would suffer memory corruption, others consume more power than necessary, some laptops overheat while suspended. The new functionality introduced with Linux 4.8 matches behavior of Windows 8 and newer which improves compatibility. (Laptop vendors unfortunately still violate specifications and seem to be happy enough that it passes Windows validation tests.)
How're you supposed to do that on a laptop?
See https://bugs.freedesktop.org/show_bug.cgi?id=75985 and https://devtalk.nvidia.com/default/topic/1024022/linux/gtx-1060-no-audio-over-hdmi-only-hda-intel-detected-azalia. The problem described there is different, but the remove/rescan commands are the same.
Isn't it possible to emulate these commands behaviour in bbswitch somehow? Or does it brake older Kernel versions?
Well, I tried running
echo 1 > /sys/bus/pci/devices/0000:02:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan
followed by restarting bumblebeed and then attempting to run glxgears with optirun, and I got a very nice general protection fault followed by a total lock-up. So it doesn't appear that that solution alone is a sufficient workaround for this regression.
Isn't it possible to emulate these commands behaviour in bbswitch somehow? Or does it brake older Kernel versions?
There is an experimental branch, but I never got around finishing it fully because I ran into other problems, such as a system lockup (and a HDMI audio function that prevented suspend). That lockup problem was not limited to the new functionality though, it would also occur with older bbswitch or the current kernel (see #764).
@alexforencich Forgot to mention that you also have to rmmod bbswitch before doing it. It currently assumes that the PCI device is always fixed, but that is not the case when you remove it via sysfs. This should be solved with the experimental branch which changes bbswitch to a proper PCI driver, but that work is not finished.
OK, so this seems to work to get the card responsive again on my machine:
systemctl stop bumblebeed
rmmod nvidia
rmmod bbswitch
echo 1 > /sys/bus/pci/devices/0000:02:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan
modprobe bbswitch
systemctl start bumblebeed
I wanted to comment that on 4.18.1 when this happened to me /sys/bus/pci/devices/*/remove
and /sys/bus/pci/devices/*/rescan
did not exist once it had fallen off the bus. I was able to restart and things worked again (for now). Though I wish there was something else I could do to fix this without rebooting since rescan and remove disappear for me when that happens.
This isn't working anymore. After running the above fix and attempting to load the driver with optirun or primusrun, I am now getting the following error:
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)
At this point, bumblebee is essentially useless unless this issue can be fixed.
same bug here since linux 4.17
I am getting the same error as per #455 The fixes posted have no effect :(
also, the
/sys/module/rcutree/parameters/rcu_idle_gp_delay
file seems to be missing, so the tee method will not work. Don't really have no idea where to set:What strange is, if I use NVIDIA as main card with proprietary drivers, it actually work fine.
I am on ArchLinux with kernel 4.12.13