CuarzoSoftware / SRM

Simple Rendering Manager
MIT License
51 stars 5 forks source link

Segfault in SRM when launching an example Louvre compositor #12

Closed JaanDev closed 4 months ago

JaanDev commented 4 months ago

Launching any of louvre-default, louvre-views or the example compositor segfaults. I have a dedicated nvidia gpu with proprietary drivers and an integrated intel gpu on endeavour os. When debugging an example project with gdb, it shows a segfault here.

Pls tell me if i need to provide any additional info.

Thanks!

JaanDev commented 4 months ago

Here is a coredump:

(gdb) where
#0  dri_get_egl_image () at ../mesa-24.0.5/src/gallium/frontends/dri/dri_screen.c:765
#1  0x00007fa225cf4642 in st_get_egl_image () at ../mesa-24.0.5/src/mesa/state_tracker/st_cb_eglimage.c:243
#2  0x00007fa225cd4a30 in egl_image_target_texture () at ../mesa-24.0.5/src/mesa/main/teximage.c:3579
#3  0x00007fa23359c4d0 in srmBufferGetTextureID (device=0x5a5dc4f60d00, buffer=0x7fa1f002b0b0) at ../SRM-0.5.5-1/src/lib/SRMBuffer.c:455
#4  0x00007fa2335a35fa in createDRMFramebuffers (connector=0x5a5dc504ec20) at ../SRM-0.5.5-1/src/lib/private/modes/SRMRenderModeDumb.c:450
#5  initialize (connector=0x5a5dc504ec20) at ../SRM-0.5.5-1/src/lib/private/modes/SRMRenderModeDumb.c:720
#6  0x00007fa2335a3e0e in srmConnectorRenderThread (conn=0x5a5dc504ec20) at ../SRM-0.5.5-1/src/lib/private/SRMConnectorPrivate.c:492
#7  0x00007fa233ea955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#8  0x00007fa233f26a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

coredump.zip

ehopperdietzel commented 4 months ago

Hi, if it crashes when calling glEGLImageTargetTexture2DOES it could be a driver issue, just like discussed here.

Maybe running it with SRM_FORCE_GL_ALLOCATION=1 could fix it.

If not, then using nouveau is other option.

ehopperdietzel commented 4 months ago

Maybe you need to set this driver parameter as well.

By the way, do other Wayland compositors work with your Nvidia card?

ehopperdietzel commented 4 months ago

I've added some GL and EGL extension checks to the devel branch. Could you verify if that resolves the issue?

JaanDev commented 4 months ago

Hi! Thanks for a reply!

Running it with SRM_FORCE_GL_ALLOCATION=1 (on main branch) didnt fix the issue. I have the nvidia-drm.modeset=1 kernel parameter set in grub.

By the way, do other Wayland compositors work with your Nvidia card?

Well, not quite. I've already tried many different wayland compositors (gnome, kde, cinnamon, wlroots based, smithay based) and they work great on my laptop's internal monitor but my external monitor is either flickering or not smooth at all (compared to Windows). This is especially noticable in games, they dont exceed ~30fps on my external monitor but work as expected on the internal one. However, on xorg it works as good as my internal one. This is one of the reasons I decided to try louvre.

After trying the devel branch it indeed doesnt segfault anymore! The other louvre examples work as well.

Now the problem is that when i move the cursor the compositor entirely freezes for a few seconds but i guess that is not a SRM issue anymore =) Should i open an issue in louvre about this?

ehopperdietzel commented 4 months ago

Great! Thanks to you! The hardware cursor plane issue seems to be specific to the nvidia-drm driver, it's the only one where I've seen it. I've just disabled it for Nvidia by default in the devel branch and also added the SRM_NVIDIA_CURSOR environment variable to enable it if desired. Could you see if that fixes it?

P.S. You can enable triple buffering for better performance with SRM_RENDER_MODE_ITSELF_FB_COUNT=3 (Intel) and SRM_RENDER_MODE_DUMB_FB_COUNT=3 (Nvidia).

ehopperdietzel commented 4 months ago

One last thing, with SRM_FORCE_LEGACY_API=0, I get 60 FPS with my Nvidia card displaying the TestUFO in fullscreen mode, and 40 FPS with SRM_FORCE_LEGACY_API=1. Perhaps you'll have similar results.

SRM_FORCE_LEGACY_API=1 is the default in Louvre.

JaanDev commented 4 months ago

Could you see if that fixes it?

Yes it does. All louvre examples are usable now!

When i dont specify the SRM_ALLOCATOR_DEVICE or write SRM_ALLOCATOR_DEVICE=/dev/dri/card0 (the intel one), louvre-views has high cpu usage on the ufo test and it shows max 80fps on my external monitor but ~300 on the internal (without vsync). Can i fix this somehow or is it intended? But when i add SRM_ALLOCATOR_DEVICE=/dev/dri/card1 (which is the nvidia gpu) it doesnt launch at all, my system hangs and the only way to fix it is to force shutdown. SRM_RENDER_MODE_ITSELF_FB_COUNT=3 SRM_RENDER_MODE_DUMB_FB_COUNT=3 SRM_FORCE_LEGACY_API=0 is also set btw

ehopperdietzel commented 4 months ago

SRM always prefers the integrated GPU for buffer allocation because it's closer to the CPU, resulting in faster transfers.

Louvre-views exhibits high CPU usage.

Textures, for reasons I don't fully understand, almost never can be shared across GPUs with DMA, even if they announce support for the same formats/modifiers. This means that you can't, for example, drag a window from a display connected to one GPU to another display connected to another GPU. In this case, the Intel GPU handles all the rendering for displays connected to your Nvidia GPU, and it copies the result into Nvidia dumb buffers, which are a special type of buffer that can be directly scanned out, hence the high CPU usage.

But ~300 on the internal (without vsync).

Assuming your display has a refresh rate of 150 Hz (or close), when V-Sync is disabled, it is limited to twice the original refresh rate, but you can change that value using LOutput::setRefreshRateLimit().

But when I add SRM_ALLOCATOR_DEVICE=/dev/dri/card1 (which is the Nvidia GPU), it doesn't launch at all. My system hangs, and the only way to fix it is to force shutdown.

Well, that must be just Nvidia... in my setup, it works if I use the Nvidia as allocator, but it has terrible performance, so you're not missing much.

I will release SRM v0.5.6 shortly, with all these fixes, and I'll also add an environment variable to blacklist devices, for example, SRM_DEVICES_BLACKLIST=/dev/dri/card0:/dev/dri/card1. So you could try enabling only the Nvidia GPU to see if it works alone.

My final suggestion though is to use nouveau, in my experience, it works much smoother.

JaanDev commented 4 months ago

Alright, thank you for such a detailed response!