Bumblebee-Project / Bumblebee

Bumblebee daemon and client rewritten in C
http://www.bumblebee-project.org/
GNU General Public License v3.0
1.3k stars 142 forks source link

Getting Fallen off the bus dmesg with 4.12.13 #917

Open Xiaoming94 opened 7 years ago

Xiaoming94 commented 7 years ago

I am getting the same error as per #455 The fixes posted have no effect :(

also, the /sys/module/rcutree/parameters/rcu_idle_gp_delay file seems to be missing, so the tee method will not work. Don't really have no idea where to set:

CONFIG_HZ_1000=y
CONFIG_HZ=1000

What strange is, if I use NVIDIA as main card with proprietary drivers, it actually work fine.

I am on ArchLinux with kernel 4.12.13

alexforencich commented 7 years ago

I have been seeing the same issue for probably at least a year now, also on Arch. It would be nice to see this fixed, as I have been unable to use the nvidia card in my laptop for many months!

optirun verbose output

$ optirun -vvv glxgears -info
[323488.291664] [DEBUG]Reading file: /etc/bumblebee/bumblebee.conf
[323488.292166] [DEBUG]optirun version 3.2.1 starting...
[323488.292181] [DEBUG]Active configuration:
[323488.292195] [DEBUG] bumblebeed config file: /etc/bumblebee/bumblebee.conf
[323488.292211] [DEBUG] X display: :8
[323488.292224] [DEBUG] LD_LIBRARY_PATH: /usr/lib/nvidia:/usr/lib32/nvidia
[323488.292236] [DEBUG] Socket path: /var/run/bumblebee.socket
[323488.292249] [DEBUG] Accel/display bridge: virtualgl
[323488.292263] [DEBUG] VGL Compression: proxy
[323488.292276] [DEBUG] VGLrun extra options: 
[323488.292289] [DEBUG] Primus LD Path: /usr/lib/primus:/usr/lib32/primus
[323488.346264] [INFO]Response: No - error: Could not load GPU driver
[323488.346282] [ERROR]Cannot access secondary GPU - error: Could not load GPU driver
[323488.346286] [DEBUG]Socket closed.
[323488.346305] [ERROR]Aborting because fallback start is disabled.
[323488.346313] [DEBUG]Killing all remaining processes.

dmesg output

[323497.868821] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[323497.869310] NVRM: The NVIDIA GPU 0000:02:00.0
                NVRM: (PCI ID: 10de:0fe4) installed in this system has
                NVRM: fallen off the bus and is not responding to commands.
[323497.869366] nvidia: probe of 0000:02:00.0 failed with error -1
[323497.869396] NVRM: The NVIDIA probe routine failed for 1 device(s).
[323497.869397] NVRM: None of the NVIDIA graphics adapters were initialized!
[323497.869502] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
y-usuzumi commented 7 years ago

Confirmed on kernel 4.13.5

Lekensteyn commented 6 years ago

@alexforencich The problem occurs as soon as the device is suspended (either with bbswitch or with runtime PM since kernel 4.8). The problem is not specific to Bumblebee :-/

Xiaoming94 commented 6 years ago

So the workaround currently is to use the GPU fulltime as suggested by Nvidia themselves?

Lekensteyn commented 6 years ago

If you don't care about your battery, then that is an option... personally I do care and use the nouveau driver instead of Bumblebee/bbswitch. That allows me to connect external monitors and suspend the GPU in other cases.

What you could try as alternative is to remove the PCI device and rescan before you load nvidia (but after the power is restored).

alexforencich commented 6 years ago

How're you supposed to do that on a laptop?

alexforencich commented 6 years ago

And if this is a known regression since kernel version 4.8, why hasn't it been rolled back yet?

Lekensteyn commented 6 years ago

And if this is a known regression since kernel version 4.8, why hasn't it been rolled back yet?

Because it fixes other problems. Without this new approach, some Lenovo laptops would suffer memory corruption, others consume more power than necessary, some laptops overheat while suspended. The new functionality introduced with Linux 4.8 matches behavior of Windows 8 and newer which improves compatibility. (Laptop vendors unfortunately still violate specifications and seem to be happy enough that it passes Windows validation tests.)

How're you supposed to do that on a laptop?

See https://bugs.freedesktop.org/show_bug.cgi?id=75985 and https://devtalk.nvidia.com/default/topic/1024022/linux/gtx-1060-no-audio-over-hdmi-only-hda-intel-detected-azalia. The problem described there is different, but the remove/rescan commands are the same.

Xiaoming94 commented 6 years ago

Isn't it possible to emulate these commands behaviour in bbswitch somehow? Or does it brake older Kernel versions?

alexforencich commented 6 years ago

Well, I tried running

echo 1 > /sys/bus/pci/devices/0000:02:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan

followed by restarting bumblebeed and then attempting to run glxgears with optirun, and I got a very nice general protection fault followed by a total lock-up. So it doesn't appear that that solution alone is a sufficient workaround for this regression.

Lekensteyn commented 6 years ago

Isn't it possible to emulate these commands behaviour in bbswitch somehow? Or does it brake older Kernel versions?

There is an experimental branch, but I never got around finishing it fully because I ran into other problems, such as a system lockup (and a HDMI audio function that prevented suspend). That lockup problem was not limited to the new functionality though, it would also occur with older bbswitch or the current kernel (see #764).

@alexforencich Forgot to mention that you also have to rmmod bbswitch before doing it. It currently assumes that the PCI device is always fixed, but that is not the case when you remove it via sysfs. This should be solved with the experimental branch which changes bbswitch to a proper PCI driver, but that work is not finished.

alexforencich commented 6 years ago

OK, so this seems to work to get the card responsive again on my machine:

systemctl stop bumblebeed
rmmod nvidia
rmmod bbswitch
echo 1 > /sys/bus/pci/devices/0000:02:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan
modprobe bbswitch
systemctl start bumblebeed
samcv commented 6 years ago

I wanted to comment that on 4.18.1 when this happened to me /sys/bus/pci/devices/*/remove and /sys/bus/pci/devices/*/rescan did not exist once it had fallen off the bus. I was able to restart and things worked again (for now). Though I wish there was something else I could do to fix this without rebooting since rescan and remove disappear for me when that happens.

alexforencich commented 6 years ago

This isn't working anymore. After running the above fix and attempting to load the driver with optirun or primusrun, I am now getting the following error:

           NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
           NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)

At this point, bumblebee is essentially useless unless this issue can be fixed.

PalinuroSec commented 6 years ago

same bug here since linux 4.17