NVIDIA / egl-wayland

The EGLStream-based Wayland external platform
MIT License
293 stars 47 forks source link

WL Vulkan apps are broken with PRIME #72

Closed TheComputerGuy96 closed 7 months ago

TheComputerGuy96 commented 1 year ago

Hello,

This is sort of a continuation of #41 but for Vulkan apps/games

So Vulkan apps (like PPSSPP or vkcube) fail to work with Wayland on my PRIME setup:

$ prime-run vkcube-wayland 
Selected GPU 0: NVIDIA GeForce GTX 1650 Ti, type: DiscreteGpu 
[destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a

As you can see it's identical to the OpenGL error (but the OpenGL one has already been fixed) but I also checked the Wayland logs and the (probably) NVIDIA modifier is present (so the linear modifier needs to be used somehow)

Running both PPSSPP and vkcube with XWayland removes the problem (by using SDL_VIDEODRIVER=x11 variable or the X11 vkcube executable)

And now time for the all important system info 🐸 (although it's kinda redundant here): Distro: Arch Linux egl-wayland version: 1.1.11 (Git version also fails) Mesa version: 22.2.1 Driver version: 515.76 Kernel version: 6.0.6 Compositor: mutter 43.0 (through an unofficial repo) CPU: Ryzen 5 4600H GPU: Renoir iGPU + GTX 1650 Ti Mobile (as I said a PRIME setup)

DavidWoli commented 11 months ago

@erik-kz stable release of the 545 driver still has this bug. Any plans to fix it?

erik-kz commented 11 months ago

Any plans to fix it?

yes

erik-kz commented 11 months ago

Also eglgears_wayland has a similar bug, it stops animating when you unfocus. But there launching multiple apps does nothing, and animation works when you are focused by default, so may not be directly related.

This is an unrelated issue, and it's not an NVIDIA bug. The problem is that eglgears_wayland calls poll on the Wayland socket without using wl_display_prepare_read / wl_display_read_events. See src/egl/eglut/wsi/wayland.c in the mesa-demos repo. This causes problems if there are other threads also trying to read from the socket.

@flukejones Regarding the wgpu issue, I actually can reproduce it, but I'm not sure if it's related to the hangs other users have reported. Interestingly, if I capture a stack trace it's a bit different than the one you posted. On my system, it doesn't hang in vkWaitForFences but instead just spins in the winit event loop after presenting the first frame. Another thing is that setting WINIT_UNIX_BACKEND=x11 doesn't seem to do anything for me, it still uses Wayland.

Otherwise, I spent a fair amount of time today trying to reproduce the vkcube-wayland hang on multiple machines, with different compositors, etc. but it continues to elude me.

vasishath commented 11 months ago

Otherwise, I spent a fair amount of time today trying to reproduce the vkcube-wayland hang on multiple machines, with different compositors, etc. but it continues to elude me.

Can you tell what are the hardware/software specs of the systems you tried to produce this issue on? Maybe that way we can pin point the difference which is creating the issue.

kanashimia commented 11 months ago

Another thing is that setting WINIT_UNIX_BACKEND=x11 doesn't seem to do anything for me, it still uses Wayland.

They removed that environment variable in the newest winit release, it now detects backend based on DISPLAY / WAYLAND_DISPLAY environment variables.

erik-kz commented 11 months ago

Can you tell what are the hardware/software specs of the systems you tried to produce this issue on? Maybe that way we can pin point the difference which is creating the issue.

System 1 - Arch, RTX A3000 Mobile, Intel Core i7 11850H (PRIME workstation laptop) System 2 - Ubuntu 23.10, RTX 2080, Intel Core i3 4150 (desktop, but configured as a PRIME system with iGPU driving the display) System 3 - Arch, GTX 1080 Mobile, Intel Core i7 6700 (non-PRIME gaming laptop) Compositors I tested were mutter, kwin, and sway

One thought I had was that the GPUs in all of the above systems had a display engine (even the one in the PRIME laptop). Some of our mobile chips do not, however. Is this different from y'all's systems? An easy way to check is to look at the lspci output. If the GPU is display-capable it will say "VGA compatible controller" and if not "3D controller"

I don't why that would matter, but at this point I'm clutching at straws.

vasishath commented 11 months ago

Can you tell what are the hardware/software specs of the systems you tried to produce this issue on? Maybe that way we can pin point the difference which is creating the issue.

System 1 - Arch, RTX A3000 Mobile, Intel Core i7 11850H (PRIME workstation laptop) System 2 - Ubuntu 23.10, RTX 2080, Intel Core i3 4150 (desktop, but configured as a PRIME system with iGPU driving the display) System 3 - Arch, GTX 1080 Mobile, Intel Core i7 6700 (non-PRIME gaming laptop) Compositors I tested were mutter, kwin, and sway

One thought I had was that the GPUs in all of the above systems had a display engine (even the one in the PRIME laptop). Some of our mobile chips do not, however. Is this different from y'all's systems? An easy way to check is to look at the lspci output. If the GPU is display-capable it will say "VGA compatible controller" and if not "3D controller"

I don't why that would matter, but at this point I'm clutching at straws.

I know mine, its an MX150 and is a 3D Controller.

My complete specs are: Arch Linux Core i5 8250u + Intel UHD 620 driving the display Nvidia MX 150

kanashimia commented 11 months ago

Mine are both VGA compatible, with specs as was mentioned before: Intel i5-8300H [UHD 630] - display driver NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1) OS: NixOS Laptop: HP Pavilion Gaming 15-cx0045ur

vasishath commented 11 months ago

System 1 - Arch, RTX A3000 Mobile, Intel Core i7 11850H (PRIME workstation laptop) .

How are the drivers on these systems installed? I am using nvidia-dkms package from the arch official repositories. Just in case if that makes any difference..

erik-kz commented 11 months ago

How are the drivers on these systems installed? I am using nvidia-dkms package from the arch official repositories. Just in case if that makes any difference..

I was using the ".run" file from the NVIDIA website. It looks like the official nvidia-dkms package has not yet been updated to 545. Did you mean the nvidia-beta one in the AUR? I just tried again with that (and nvidia-utils-beta) but it did not change anything.

vasishath commented 11 months ago

How are the drivers on these systems installed? I am using nvidia-dkms package from the arch official repositories. Just in case if that makes any difference..

I was using the ".run" file from the NVIDIA website. It looks like the official nvidia-dkms package has not yet been updated to 545. Did you mean the nvidia-beta one in the AUR? I just tried again with that (and nvidia-utils-beta) but it did not change anything.

Oh yea exactly.. nvidia-beta-dkms from AUR.. my bad. I really thought maybe there was some packaging issue that caused this issue.

Edit: I am using 6.5.8-zen kernel instead of the official one.

erik-kz commented 11 months ago

Straying from the main topic, but @kanashimia

It seems that udev rules fail to create devices in /dev because grep nvidia-frontend /proc/devices doesn't find anything. After replacing $$(grep nvidia-frontend /proc/devices | cut -d \ -f 1 with 195 those problems were solved, driver launches like before.

Sorry for overlooking this initially, but I've confirmed with the kernel module folks on our team that the /proc/devices name has indeed been changed from nvidia-frontend to nvidia in 545. We apologies for not anticipating that this might break some workflows They suggested something like grep "\<nvidia\>" /proc/modules as an alternative solution.

leiserfg commented 11 months ago

Okay so as of 545.23.06, atleast on KDE wayland, now vkcube-wayland doesn't crash but freezes immediately on startup. The cube is visible but spins very slow, like 1 frame every 5 seconds.

Also, When I tried to run yuzu emulator on wayland with vulkan backend selected, I get the following error:

error marshalling arguments for get_surface_feedback (signature 4no): null value passed for arg 1
Error marshalling request: Invalid argument
The Wayland connection experienced a fatal error: Invalid argument 

KDE Plasma 5.27.8 Distro: Arch Linux Kernel: 6.5.7-zen

I'm having the same issue with NVIDIA (no prime) in sway.

Sorry for the "me too" comment, I just want to make it clear that is not because of prime.

kanashimia commented 11 months ago

@leiserfg can you share your system info? CPU, GPU, OS, laptop model (if it is one)?

leiserfg commented 11 months ago

it's a PC CPU: AMD Ryzen 5 3600 (Does not have iGPU) GPU: 1660 SUPER OS: Nixos 23.05 desktop: sway

erik-kz commented 11 months ago

I think we should keep this discussion focused on the PRIME problem. The yuzu crash is tracked here https://github.com/yuzu-emu/yuzu/issues/11941

I am able to reproduce it with a debug build of the driver and will dig deeper next week. Anything I find out will be posted to the other issue I linked.

erik-kz commented 11 months ago

Back to the PRIME problem, I probably should have requested this earlier, but another thing that might help is running the nvidia-bug-report.sh script that is installed with the driver and uploading the file it generates here. Ideally immediately after reproducing the bug in case there are any relevant messages in the system log.

vasishath commented 10 months ago

I think we should keep this discussion focused on the PRIME problem. The yuzu crash is tracked here yuzu-emu/yuzu#11941

I am able to reproduce it with a debug build of the driver and will dig deeper next week. Anything I find out will be posted to the other issue I linked.

So i tried to use the yuzu-cmd (which uses SDL) to start a game. The game starts, but it only displays first frame and then freezes. If i run prime-run vkcube (on xwayland) in the background, then the yuzu game frames don't freeze.

One thing I did notice is when the frames don't render, even the respective sound doesn't play, means its not that the frames are not visible on the screen, they are not being received by yuzu.

Other than yuzu, i tried 0ad on wayland and it runs with an invisible window.

@erik-kz can you please tell what all module options you are using?

erik-kz commented 10 months ago

The only non-default module option I am using is "modeset=1" for nvidia-drm. As I said in my previous comment, uploading the file generated by running nvidia-bug-report.sh after reproducing the bug would be helpful.

For what it's worth, I was able to make some progress on the wgpu hang. I have a small driver change that does fix it, although I'm still trying to understand why it only seems to be necessary for that particular application. Also, I still don't know if that's related to the issues with other applications (which I haven't been able to reproduce).

vasishath commented 10 months ago

The only non-default module option I am using is "modeset=1" for nvidia-drm. As I said in my previous comment, uploading the file generated by running nvidia-bug-report.sh after reproducing the bug would be helpful.

For what it's worth, I was able to make some progress on the wgpu hang. I have a small driver change that does fix it, although I'm still trying to understand why it only seems to be necessary for that particular application. Also, I still don't know if that's related to the issues with other applications (which I haven't been able to reproduce).

Oh yes I had captured the bug report but forgot to attach it.. here it is nvidia-bug-report.log.gz

vasishath commented 10 months ago

For what it's worth, I was able to make some progress on the wgpu hang. I have a small driver change that does fix it, although I'm still trying to understand why it only seems to be necessary for that particular application. Also, I still don't know if that's related to the issues with other applications (which I haven't been able to reproduce).

I'd like to add here what I noticed during my testing. During the time I ran yuzu-cmd (SDL) on wayland (with vkcube xwayland running in background), I noticed that on switching to a different KDE virtual desktop, the game's background audio continued playing but the other sounds (SFX?) stopped playing. This is very similar to what happens when the vkcube on xwayland is not running in the background.

I don't know if I am right but I have a hunch that the bug is related to some power saving feature. The driver somehow doesn't know that the game window is actively being used and hence decides to stop sending frames to save power.

erik-kz commented 10 months ago

Oh yes I had captured the bug report but forgot to attach it.. here it is nvidia-bug-report.log.gz

Thanks!

I have a hunch that the bug is related to some power saving feature.

Yeah, that does seem plausible. Does connecting the laptop to a power supply vs. running on battery make a difference? When you only run vkcube-wayland, what P-state does nvidia-smi report (see attached screenshot). Does this change when you run normal vkcube? Similarly, what does nvidia-smi -q -d CLOCK report in those two scenarios? Screenshot_20231111_131036

gdp2000 commented 10 months ago

Looks like I might have the same issue. I'm not able to run any native Vulkan app in Wayland. It's working on the integrated Intel.

CPU: 12th Gen Intel(R) Core(TM) i7-1270P GPU: NVIDIA T550 Laptop GPU RAM : 16GB Driver: NVIDIA 545.29.02 OS: Arch Linux 6.6 with Wayland

Attached logs after yuzu crashed. I can add more logs after other applications crahs, if necessary.

nvidia-bug-report.log.gz

erik-kz commented 10 months ago

From Vasishath's nvidia-bug-report.log.gz, the following snippet is interesting...

    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : Unknown Error
        Requested Power Limit             : Unknown Error
        Default Power Limit               : 5001.00 W
        Min Power Limit                   : 0.00 W
        Max Power Limit                   : 5001.00 W

Even on my GTX1080 system, which should have the same power management features (I think), I'm not seeing weird values like that.

gdp2000 commented 10 months ago

Looks like modeset in Nvidia DRM module was not enabled. I enabled it via systemd-boot options by adding nvidia_drm.modeset=1 and now everything looks good. Tested Vulkan in Wayland with yuzu and Retroarch without any issues. Should I enable fbdev=1 as well?

vasishath commented 10 months ago

Does connecting the laptop to a power supply vs. running on battery make a difference?

No. I just tried it and it made no difference.

When you only run vkcube-wayland, what P-state does nvidia-smi report, Does this change when you run normal vkcube? Similarly, what does nvidia-smi -q -d CLOCK report in those two scenarios?

No differences here either. The P state with vkcube and vkcube-wayland first jump to P0, then to P3, then P5 and then finally back to P8. Similar pattern in clock frequencies.

vasishath commented 10 months ago

Yeah, that does seem plausible. Does connecting the laptop to a power supply vs. running on battery make a difference?

Just noticed this today.. If I enable the "Force maximum clocks" option in yuzu, the game frames freeze much less often when running without vkcube. But this option results in poor performance on budget GPUs. This is increasingly looking like a power management issue somewhere.

vasishath commented 10 months ago

Looks like modeset in Nvidia DRM module was not enabled. I enabled it via systemd-boot options by adding nvidia_drm.modeset=1 and now everything looks good. Tested Vulkan in Wayland with yuzu and Retroarch without any issues. Should I enable fbdev=1 as well?

So you don't have to run vkcube in the background in order to run the game? The game doesn't get stuck at 0 fps on your system?

Also, are you able to run 0ad on wayland? SDL_VIDEODRIVER=wayland prime-run 0ad For me, the game window doesn't even show up.

erik-kz commented 10 months ago

Does removing the intel_idle.max_cstate= option in your kernel command line change anything? Or perhaps the intel_iommu= and i915.enable_gvt= options?

gdp2000 commented 10 months ago

Looks like modeset in Nvidia DRM module was not enabled. I enabled it via systemd-boot options by adding nvidia_drm.modeset=1 and now everything looks good. Tested Vulkan in Wayland with yuzu and Retroarch without any issues. Should I enable fbdev=1 as well?

So you don't have to run vkcube in the background in order to run the game? The game doesn't get stuck at 0 fps on your system?

No need to run vkcube. Since I set modeset=1 in nvidia_drm all crashes are gone.

Also, are you able to run 0ad on wayland? SDL_VIDEODRIVER=wayland prime-run 0ad For me, the game window doesn't even show up.

I can run 0ad it with the command specified, but the game window does not show up. I can only here sound. Running it directly with 0ad works fine.

vasishath commented 10 months ago

Does removing the intel_idle.max_cstate= option in your kernel command line change anything? Or perhaps the intel_iommu= and i915.enable_gvt= options?

I removed all iommu, gvt-g related flags and max_cstate from command line but this hasn't changed anything. Btw, whenever I enable fbdev=1 in nvidia_drm module, I get these logs in dmesg:

kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
kernel: nvidia 0000:01:00.0: [drm] No compatible format found
kernel: nvidia 0000:01:00.0: [drm] Cannot find any crtc or sizes

Also, can this issue be on Intel side of things? For example, I have GuC/HuC submission enabled.

vasishath commented 10 months ago

No need to run vkcube. Since I set modeset=1 in nvidia_drm all crashes are gone.

and if you run prime-run vkcube-wayland, does the cube spin?

As for yuzu, for me the game starts, but the fps stays mostly at 0 unless I have the xwayland vkcube running in background. I have the "force maximum clocks" option set to off in yuzu as it causes poor performance for me. Do you have that option turned on?

gdp2000 commented 10 months ago

Yes, the cube spins. This is the console output:

[user@laptop ~]$ prime-run vkcube-wayland
Selected GPU 0: NVIDIA T550 Laptop GPU, type: DiscreteGpu
vasishath commented 10 months ago

@erik-kz I ran some more hit and trials.. what I have observed is that we need atleast 2 vulkan apps running for each one of them to not freeze. i.e., running two vkcube-wayland instances make both of them spin smoothly without any issues, but the moment I close any of them, the other one freezes as well. Running xwayland app is not needed, as I assumed earlier. Also, both apps must be running in windowed mode. If I switch any of them to fullscreen, both freeze again.

Also, both apps must be vulkan, means running an OpenGL app doesn't help, even if it is resource heavy one.

Yuzu, for me runs on OpenGL just fine without any background app needed. But eglgears_wayland doesn't spin until I move my mouse pointer, 0ad doesn't show up at all.. and unlike vulkan, they don't work with a background app running. But these work just fine on intel and previous nvidia driver.. I'm still confused if the bug is vulkan only or includes OpenGL.

Dirleye commented 10 months ago

@erik-kz I ran some more hit and trials.. what I have observed is that we need atleast 2 vulkan apps running for each one of them to not freeze. i.e., running two vkcube-wayland instances make both of them spin smoothly without any issues, but the moment I close any of them, the other one freezes as well. Running xwayland app is not needed, as I assumed earlier. Also, both apps must be running in windowed mode. If I switch any of them to fullscreen, both freeze again.

Also, both apps must be vulkan, means running an OpenGL app doesn't help, even if it is resource heavy one.

Yuzu, for me runs on OpenGL just fine without any background app needed. But eglgears_wayland doesn't spin until I move my mouse pointer, 0ad doesn't show up at all.. and unlike vulkan, they don't work with a background app running. But these work just fine on intel and previous nvidia driver.. I'm still confused if the bug is vulkan only or includes OpenGL.

I can replicate this. Yuzu (Wayland) or vkcube-wayland will run at 0-1fps (GPU stuck in power state p8 and min clocks) unless another Vulkan app is also running (I use vkcube-wayland for this). Closing one freezes the other unless the other was using XWayland, which always works fine by itself.

Apologies if this is already known and I've missed it, but it's definitely not a Prime issue as my system is a desktop without an iGPU.

erik-kz commented 10 months ago

@Dirleye thanks for the information! If you could please upload the file generated by our nvidia-bug-report.sh script that would be helpful. We're still trying to figure out why only certain systems seem to be experiencing this bug, so the more data we have, the better.

Dirleye commented 10 months ago

@erik-kz of course, no problem. Attached is the log generated with nvidia-bug-report.sh after turning on the system, using startx as root in a different tty to apply an overclock (though the bug isn't affected by this either way), running vkcube-wayland for about two minutes and then generating the report.

I occasionally checked nvidia-smi to see the clock speeds and power state which were all glued to their lowest throughout. Vkcube-wayland's window was permanently marked as "not responding", though the cube was spinning at full speed. It would freeze for a few seconds each time focus was swapped between it and the terminal.

nvidia-bug-report.log.gz

kanashimia commented 10 months ago

Here's mine: nvidia-bug-report.log.gz

erik-kz commented 10 months ago

Omg, I finally managed to reproduce the vkcube-wayland hang with a different GPU (Quadro P620). Not exactly sure what the cause it yet, but at least now it's possible to debug. What does seem immediately clear is that it's not a power management issue, it actually looks like it's related to a new synchronization mechanism that was introduced in 545. I shall update with further progress. Thanks so much to everyone who provided logs, etc... that definitely helped narrow down the problem.

jrelvas-ipc commented 10 months ago

This appears to be fixed with the 545.29.06 driver release! imagem

Here's Half-Life 2, running on Wayland with Vulkan! imagem

Dirleye commented 10 months ago

Unfortunately still broken for me.

erik-kz commented 10 months ago

A quick update - we have figured out what is causing the issue. It did turn out to be a driver bug affecting pre-Turing GPUs. The fix is targeted for the next driver release, 550, early next year.

oscarbg commented 10 months ago

@erik-kz does it mean that post Turing GPUs are fixed now in 545.29.06 and don't need a future 550 driver?

erik-kz commented 10 months ago

does it mean that post Turing GPUs are fixed now in 545.29.06 and don't need a future 550 driver?

Vulkan Wayland applications should be working correctly with 545.29.06 on Turing-or-later GPUs. Including PRIME render-offload.

The issue I was referring to in my previous comment was the extremely low framerates (0.2FPS) that several users had reported. All of those users had Pascal GPUs.

vasishath commented 10 months ago

Can you share some technical details about what exactly the issue was and any workaround (other than running a background app) for the time being?

On Wed, 6 Dec, 2023, 04:54 Erik Kurzinger, @.***> wrote:

does it mean that post Turing GPUs are fixed now in 545.29.06 and don't need a future 550 driver?

Vulkan Wayland applications should be working correctly with 545.29.06 on Turing-or-later GPUs. Including PRIME render-offload.

The issue I was referring to in my previous comment was the extremely low framerates (0.2FPS) that several users had reported. All of those users had Pascal GPUs.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/egl-wayland/issues/72#issuecomment-1841787822, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOOCZ4KL52A56UFD6YSGZ3YH6UJFAVCNFSM6AAAAAARWIM3T6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBRG44DOOBSGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

erik-kz commented 10 months ago

The 545 driver was the first version to include support for sync_files, https://www.kernel.org/doc/Documentation/sync_file.txt, a new synchronization mechanism. The bug was in our implementation of that feature. 545 also included a fairly extensive re-write of the Vulkan Wayland WSI code, and part of that made use of the new sync_file functionality. That's why Vulkan Wayland apps were affected by the bug.

A possible work-around would be to extract the driver installer and edit the file nvidia-drm-drv.c. In the nv_drm_get_dev_info_ioctl function delete the following block

#if defined(NV_SYNC_FILE_GET_FENCE_PRESENT)
            params->supports_sync_fd = true;
#endif /* defined(NV_SYNC_FILE_GET_FENCE_PRESENT) */

This will disable sync_file support

kanashimia commented 10 months ago

A possible work-around would be to extract the driver installer and edit the file nvidia-drm-drv.c. In the nv_drm_get_dev_info_ioctl function delete the following block

Actually can confirm that workaround works, but why delete whole block? It seems that deleting code inside the macro is enough.

Here is a patch for NixOS users:

hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.stable.overrideAttrs (old: {
  postPatch = ''
    substituteInPlace ./kernel/nvidia-drm/nvidia-drm-drv.c --replace \
      '#if defined(NV_SYNC_FILE_GET_FENCE_PRESENT)' \
      '#if 0'
  '';
});
erik-kz commented 10 months ago

It seems that deleting code inside the macro is enough.

Yeah, that's true.

Also, I must ask that anyone who uses this work-around please promise to revert it once 550 is released. In the future more things will depend on sync_file support and so having it disabled will almost certainly cause problems.

Vincent392 commented 9 months ago

This appears to be fixed with the 545.29.06 driver release! imagem

Here's Half-Life 2, running on Wayland with Vulkan! imagem

Well, I'm going to have to test Portal. When I can.

edit 13:13 GMT: Just to be safe that other games work too, I'll check Skyrim SE via Proton and Half-Life: Blue Shift.

kanashimia commented 7 months ago

Tested that on nvidia beta drivers 550.40.07 vkcube-wayland now works correctly without any patches, I think this issue can now be closed.