Can go into infinite recursion when combined with libstrangle

smcv commented 3 years ago

While investigating ValveSoftware/steam-runtime#443, I tried running Artifact (a Vulkan game) with the libstrangle frame-rate-limiter module.

One of the failure modes I'm seeing for those looks like either a bug in the Fossilize Vulkan layer, or a bug in some other component that's made worse by how Fossilize's VK_LAYER_fossilize_GetDeviceProcAddr behaves.

Steps to reproduce

Install libstrangle. I used a Debian 11 machine with NVIDIA proprietary graphics, and installed libstrangle into /usr/local with the upstream makefile (make clean && make && sudo make install).

In Steam, set the launch options for a Vulkan game to DISABLE_VK_LAYER_VALVE_steam_overlay_1=1 stranglevk -f 3 %command% (I'm disabling the Steam overlay to reduce the number of moving parts, I get a different segfault if I leave both that and Fossilize enabled).

Set the game to launch in the "Steam Linux Runtime" compatibility tool.

Select the client_beta branches of both "Steam Linux Runtime" and "Steam Linux Runtime - Soldier" (they should be installed automatically, you just need to switch branch). steamapps/common/SteamLinuxRuntime_soldier/VERSIONS.txt needs to say pressure-vessel 0.20210809.1 or later, which is currently only in the beta; otherwise libstrangle accidentally gets disabled and you won't see the crash.

Launch the game.

Expected result

The game runs, throttled to approximately 3fps.

Actual result

Segmentation fault.

Steps to reproduce, with debugging

As above, but run Steam with PRESSURE_VESSEL_SHELL=instead in the environment. Now, when you launch a game in the container runtime, instead of the actual game you'll get an xterm, in which you can run "$@" to get the actual game.

Instead of "$@", run:

prlimit -s100000 \
gdbserver localhost:12345 \
/usr/libexec/steam-runtime-tools-0/x86_64-linux-gnu-check-vulkan --visible

Ideally, before running gdb, set the DEBUGINFOD_URLS environment variable to a space-separated list with a source of detached debug symbols for your host OS (e.g. https://debuginfod.debian.net for Debian), and also debuginfod reading from com.valvesoftware.SteamRuntime.Sdk-amd64,i386-soldier-debug.tar.gz from soldier.

On the host system, run gdb -ex 'target localhost:12345' to connect a remote debugger to the process in the container, and type cont to continue.

Stack trace

The crash seems to be infinite recursion in VK_LAYER_fossilize_GetDeviceProcAddr(), leading to a segfault when stack space runs out. The prlimit -s100000 in my reproducer is to make that happen sooner, so that there are "only" 1750 or so stack frames, rather than tens of thousands.

#0  0x00007ffff7b9829a in vkGetDeviceProcAddr (pName=0x7ffff7baa747 "vkQueueSubmit", device=0x5555558d8e48) at ./loader/trampoline.c:91
#1  vkGetDeviceProcAddr (device=0x5555558d8e48, pName=<optimized out>) at ./vulkan-headers/include/vulkan/vulkan_core.h:3330
#2  0x00007fffeb54e69e in VK_LAYER_fossilize_GetDeviceProcAddr ()
   from target:/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/vulkan_imp_layer/libVkLayer_steam_fossilize.so
#3  0x00007fffeb54e69e in VK_LAYER_fossilize_GetDeviceProcAddr ()
   from target:/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/vulkan_imp_layer/libVkLayer_steam_fossilize.so
...
#1755 0x00007fffeb54e69e in VK_LAYER_fossilize_GetDeviceProcAddr () from target:/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/vulkan_imp_layer/libVkLayer_steam_fossilize.so
#1756 0x00007fffeb54e69e in VK_LAYER_fossilize_GetDeviceProcAddr () from target:/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/vulkan_imp_layer/libVkLayer_steam_fossilize.so
#1757 0x00007ffff7b7f363 in loader_init_device_dispatch_table (dev_table=dev_table@entry=0x555555814cb0, gpa=gpa@entry=0x7fffeb54e600 <VK_LAYER_fossilize_GetDeviceProcAddr>, dev=0x5555558d8e48) at ./loader/generated/vk_loader_extensions.c:318
#1758 0x00007ffff7b9272e in loader_create_device_chain (pd=pd@entry=0x5555557f5c40, pCreateInfo=pCreateInfo@entry=0x7fffffffbdd0, pAllocator=pAllocator@entry=0x0, inst=inst@entry=0x55555558e450, dev=dev@entry=0x555555814cb0, callingLayer=callingLayer@entry=0x0, layerNextGDPA=0x0) at ./loader/loader.c:6291
#1759 0x00007ffff7b93559 in loader_layer_create_device (instance=instance@entry=0x0, physicalDevice=physicalDevice@entry=0x5555557f5de0, pCreateInfo=pCreateInfo@entry=0x7fffffffbdd0, pAllocator=pAllocator@entry=0x0, pDevice=pDevice@entry=0x7fffffffbe40, layerGIPA=layerGIPA@entry=0x0, nextGDPA=0x0) at ./loader/loader.c:5838
#1760 0x00007ffff7b9671f in vkCreateDevice (physicalDevice=0x5555557f5de0, pCreateInfo=0x7fffffffbdd0, pAllocator=0x0, pDevice=0x7fffffffbe40) at ./loader/trampoline.c:779
#1761 0x0000555555558949 in ?? ()
#1762 0x0000555555557902 in ?? ()
#1763 0x00007ffff7999d0a in __libc_start_main (main=0x555555557700, argc=1, argv=0x7fffffffc288, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffc278) at ../csu/libc-start.c:308

smcv commented 3 years ago

I think what is happening here is that as a result of 6916486d, if VK_LAYER_fossilize_GetDeviceProcAddr() somehow sees layer->getTable()->GetDeviceProcAddr == VK_LAYER_fossilize_GetDeviceProcAddr, it will call into itself until it runs out of stack space.

Perhaps GetDeviceProcAddr needs to be exempt from the check added in 6916486d, so that if pName is "vkGetDeviceProcAddr", it short-circuits to returning VK_LAYER_fossilize_GetDeviceProcAddr immediately? (As though interceptCoreDeviceCommand had been called first, as it was before 6916486d - but it would likely be easier done as a special-case.)

It's entirely possible that one of the other layers involved is doing something wrong, and libstrangle certainly has other issues - but its GetDeviceProcAddr implementation seems to be heavily based on Mesa's overlay, which I would hope is doing the fallback dance correctly. It seems to me that infinite recursion is never going to be the correct answer, so it would probably be more robust if Fossilize avoids the recursion for GetDeviceProcAddr, even if the crash is not actually Fossilize's fault.

DadSchoorse commented 3 years ago

If layer->getTable()->GetDeviceProcAddr is fossilize's own vkGetDeviceProcAddr you have bug somewhere, and I'm pretty sure it's not in fossilize since all it does to get that point is call down the layer chain. Your hypothesis that this is caused by pName being "vkGetDeviceProcAddr" also makes little sense. The last function before VK_LAYER_fossilize_GetDeviceProcAddr in your stack trace is loader_init_device_dispatch_table, which never calls vkGetDeviceProcAddr with "vkGetDeviceProcAddr".

As a total blind guess you could try replacing https://github.com/ValveSoftware/Fossilize/blob/master/layer/dispatch_helper.cpp#L31 with table->GetDeviceProcAddr = gpa.

It would also help to know in which order the layers get loaded.

ValveSoftware / Fossilize