Open Friz64 opened 5 years ago
maybe also related: https://github.com/cloudhead/rx/issues/1
I just locally ported the quad example from gfx master to glfw and reproduced the segfault issue. As expected only on PresentMode
Fifo
.
Could you post a stack trace from a debug build?
This backtrace is from my test repo
(gdb) backtrace
#0 unlink_chunk (p=p@entry=0x555555eac2f0, av=0x7ffff7e05c40 <main_arena>) at malloc.c:1469
#1 0x00007ffff7cb62c7 in _int_free (av=0x7ffff7e05c40 <main_arena>, p=0x555555eac2f0, have_lock=<optimized out>) at malloc.c:4341
#2 0x00007ffff7e856b0 in _XFreeDisplayStructure () from /lib/x86_64-linux-gnu/libX11.so.6
#3 0x00007ffff7e72c4f in XCloseDisplay () from /lib/x86_64-linux-gnu/libX11.so.6
#4 0x0000555555bf3b50 in _glfwPlatformTerminate ()
at /home/friz64/.cargo/registry/src/github.com-1ecc6299db9ec823/glfw-sys-3.3.0/src/x11_init.c:1028
#5 0x0000555555bebdce in terminate () at /home/friz64/.cargo/registry/src/github.com-1ecc6299db9ec823/glfw-sys-3.3.0/src/init.c:91
#6 0x0000555555bec596 in glfwTerminate ()
at /home/friz64/.cargo/registry/src/github.com-1ecc6299db9ec823/glfw-sys-3.3.0/src/init.c:272
#7 0x0000555555bc5a1d in glfw::init::glfw_terminate ()
at /home/friz64/.cargo/registry/src/github.com-1ecc6299db9ec823/glfw-0.32.0/src/lib.rs:706
#8 0x00007ffff7c682ac in __run_exit_handlers (status=0, listp=0x7ffff7e05718 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#9 0x00007ffff7c683da in __GI_exit (status=<optimized out>) at exit.c:139
#10 0x00007ffff7c47b72 in __libc_start_main (main=0x555555621550 <main>, argc=1, argv=0x7fffffffddd8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffddc8) at ../csu/libc-start.c:342
#11 0x000055555561b95a in _start ()
I have some great news on this! First off, this issue is not specific to GLFW as i have been able to also reproduce it with SDL2.
Here's the actual issue and how to resolve it: https://github.com/KhronosGroup/Vulkan-LoaderAndValidationLayers/issues/1894
However i'm afraid i am not familiar enough with wgpu
to implement this fix.
@Friz64 thank you! Sounds like it would go into gfx-backend-vulkan
then under xlib
feature.
Going to close this as gfx has changed a lot since this issue. If it's still the case on wgpu master, feel free to re-open.
I don't remember the full ins and outs of this issue, but i think it's perfectly fine to close (especially because there have been no other reports on this in this issue). I just tried both the hello-triangle
example and (quickly updated) GLFW code and both seem to work flawlessly!
Thank you for this maintenance work btw :+1:
With a fresh build of wgpu-native's triangle example I can reproduce this segfault with the following call stack on Pop OS 21.04 (Ubuntu): https://gist.github.com/radgeRayden/b35e11873c5538ae0961206cad366634
I can reproduce it again (on wgpu-native's triangle example, like 20% of the time, Mesa/AMD on Arch Linux)! The call stack references line 267, which contains the call to glfwTerminate()
.
I'm pretty sure this same, or at least closely related, previously mentioned bug was also present in the Vulkan Cube Demos. The official fix looked like this: https://github.com/KhronosGroup/Vulkan-LoaderAndValidationLayers/commit/0017308648b6bf8eef10ef0ffb9470576c0c2e9e.
In the cube demos, destroy the instance after closing the display system connection. It is possible for the driver to register callback functions with a library like Xlib. If the driver is unloaded when Xlib calls those callback functions, a segfault results.
The bug is explained in more detail here in https://github.com/KhronosGroup/Vulkan-LoaderAndValidationLayers/issues/1894#issuecomment-309832783.
http://www.xfree86.org/4.7.0/DRI11.html suggests that the (GL, but Vulkan here) can register a callback with Xlib. When the application calls XCloseDisplay, this callback is called and will segfault if the driver had already been unloaded, which could happen when the Vulkan instance is destroyed. Fix is to destroy the instance after cleaning up the display connection.
So, without the fix, this happenend.
vkDestroySurfaceKHR
vkDestroyInstance
-> Driver unloads and registered callback functions would now accesses freed memory (if i understand correctly?).glfwTerminate
/ XDestroyWindow
-> Xlib calls the callback functions (if i understand correctly?).The fix switches steps 2 and 3.
Now, the part i don't understand is... does the wgpu-native
triangle example ever do something that causes the driver to unload? Nothing stands out to me. So is this even the same bug? Need some help here, not an expert on these things.
Another thing, regarding the present mode... still can only reproduce it with Fifo
? This is something to consider too.
Filed https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/3262 to follow-up
The trail leads to https://gitlab.freedesktop.org/xorg/lib/libxext/-/issues/3
Note that's the only known issue when using current NV drivers. There may be other unrelated or related bugs in other drivers, or in older NV drivers, as it took a while for us to work all these teardown issues out of our stack.
Any updates on this? I too sometimes get segfaults when closing a window/application depending on the tide and the phase of the moon. I'm using wgpu in combination with iced and baseview (a windowing library specialized around embedding windows inside of other windows for plugin GUIs). Updating iced_baseview from the released iced version to the current master branch version solved the described issue for me with the included examples, but I'm still getting these same segfaults when closing my own embedded window in certain but not all hosts, and it also doesn't happen every time. I'm using wgpu 0.12.0 with the 510.54 NVIDIA drivers. Running the application under Carla, GDB now gives an even more useful backtrace (which is a good sign that there indeed is some nasty memory corruption going on somewhere):
lldb usually gives a more normal backtrace pointing to XCloseDisplay
:
But sometimes lldb points to vkDestroyInstance
:
I'm not really sure where to go from here.
As yet another data point, leaking the Surface
(so drop()
never gets called) also works around the segfault. Not an actual solution of course, but it may help track down the issue. Also, in combination with baseview
, it only seems to happen when using RawWindowHandle::Xlib
/RawDisplayHandle::Xlib
. Changing the HasRawWindowHandle
/HasRawDisplayHandle
instances to use RawWindowHandle::Xcb
/RawDisplayHandle::Xcb
instead seems to get rid of these segfaults.
I can confirm that I am also seeing this issue whilst using baseview with 535 Nvidia and wgpu 0.18
XCloseDisplay
most of the time produces a segfault. This can be reproduced with the triangle example and with a personal test repo that's using wgpu-rs i set up. ChangingPresentMode
toNoVsync
fixes the segfault. The segfault only happens on Intel/AMD Mesa and apparently not on Nvidia Proprietary.One tester using the Nvidia Proprietary driver also experienced fluctuating FPS with my test repo. LOG: https://pastebin.com/raw/UzRqcPpz (note the first FPS print is inaccurate)(Unrelated)