EGLWayland F30/19.04 lockup on app launch

sos-michael commented 5 years ago

So I gave eglstreams/eglwayland a spin on gnome 3.32 and got the strangest behaviour.

Setup: This occurred using 418 and 430 on both f30 and u19.04. Both ship with some version of kernel 5.0. I am using a P4000 but also have 5 Radeon Pro Duo around for rocm. This does not happen on X.

If I launch an app after the desktop loads there is about 10-15 second of the mouse not moving and the app not loading. The display still seems to be okay as I can see the mouse pinwheeling, but I can't move the mouse and the UI does not respond to keyboard keys. After the hang completes, everything returns to normal and I can happily use the app.

Gome-shell itself is very fluent and does not seem to load slower than expected.

I'm currently on F30 with 418 and can provide whatever logs might be helpful, but I'm not sure where to start.

mvicomoya commented 5 years ago

I don't think I've ever seen this myself. Can you double-check whether you are actually getting NVIDIA hardware acceleration or otherwise it is falling back to software rendering for any reason?

You can just run wflinfo -p wayland -a gl (from one of the waffle packages).

sos-michael commented 5 years ago

Your're right! it's totally trying to use one of my amd gpu, but how is it output making it to my p4000?

Is there perhaps a setting I have missed? I've set nvidia-drm.modest=1 as a kernel parameter

I've edited this /usr/lib/udev/rules.d/61-gdm.rules to ensure it does not block wayland when it detects the nvidia driver.

and edited /etc/gdm/custom.conf to allow for wayland.

mvicomoya commented 5 years ago

I wonder if this is a GLVND vendor preference kind of issue. Can you check what configuration files you have under /usr/share/glvnd/egl_vendor.d/ and their order?

You can actually force a specific vendor with the following environment variable: export __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json

I hope that helps.

sos-michael commented 5 years ago

I removed the 50_mesa.json and system attempts to boot wayland and the gives up with an x fallback I get a: Unrecoverable failure in required component org.gnome.Shell.desktop

mvicomoya commented 5 years ago

I checked with @erik-kz who most recently tested these mechanisms and he mentioned he never had too much trouble to get this working out of the box with the Arch packages. I was wondering whether you might still need to manually build mutter with EGLStreams support, but presumably Fedora defaulted to doing so.

Since you mentioned you have multiple GPUs in your system, perhaps you are hitting this error condition? https://gitlab.gnome.org/GNOME/mutter/blob/master/src/backends/native/meta-renderer-native.c#L3872

Can you check your systemd error messages?

sos-michael commented 5 years ago

This is the only EGL related error: Failed to create backend: The GPU /dev/dri/card10 chosen as primary is not supported by EGL.

Perhaps there is a switch somewhere I am missing, because card10 is indeed my nvidia GPU as listed in the gdm-x-session fallback output: (**) OutputClass "nvidia" setting /dev/dri/card10 as PrimaryGPU

sos-michael commented 5 years ago

I think I got it, its this: 18:54:24 systemd-udevd: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c 195 255'' failed with exit code 1.

but I don't know what to do about it. It seems like it was added here to rpmfusion repo: https://github.com/negativo17/nvidia-driver/issues/27

sos-michael commented 5 years ago

I patched the mknod errors by changing the module requirements in: /usr/lib/udev/rules.d/60-nvidia.rules from "nvidia" to "nvidia_drm". But I'm still getting this error:

Failed to create backend: The GPU /dev/dri/card10 chosen as primary is not supported by EGL.

I'm sort of out of ideas.

erik-kz commented 5 years ago

It looks like that error overwrites the one Miguel linked to (specifically G_IO_ERROR), but the latter still might be the root cause.. Would you mind blacklisting the radeon driver temporarily to check if that's the case? Either add "modprobe.blacklist=radeon" to your kernel parameters or "blacklist radeon" to a *.conf file in /etc/modprobe.d and reboot.

sos-michael commented 5 years ago

I did a "modprobe.blacklist=nouveau,amdgpu" and you're right, everything loaded as expected.

Sadly for me, this isn't really a solution, but I can't imagine this is the fault of the Nvidia driver. The amdgpu driver doesn't support a modset=0 switch (even though it does support nomodeset, go figure). So I am stuck.

NVIDIA / egl-wayland

EGLWayland F30/19.04 lockup on app launch #18