KhronosGroup / Vulkan-Loader

Vulkan Loader
https://vulkan.lunarg.com/doc/sdk/latest/linux/LoaderInterfaceArchitecture.html
Other
519 stars 284 forks source link

vkDestroySwapchainKHR() causes invalid pointer dereference on Linux with SDL2+Nvidia #593

Closed spinnylights closed 3 years ago

spinnylights commented 3 years ago

Using builds from the latest commits of this repo as well as Vulkan-Headers and Vulkan-Layers, vkDestroySwapchainKHR() attempts to dereference an invalid pointer at loader/wsi.c:487 loader/wsi.c:479 (disp, which takes on the value 0x0003c7480c75c085) in my application. In the version of these libraries I was using earlier (1.2.170), disp was a null pointer under the same conditions.

This is under Linux with kernel v5.12.5 and proprietary Nvidia drivers v465.31. My application depends on SDL2; I'm building against v2.0.14. The compiler is g++ v10.3.0 and I'm using gdb 10.2 as my debugger.

Right now the application is very small—it initializes SDL and creates a window, sets up a Vulkan instance, creates a surface, picks a physical device, creates a logical device, creates a swapchain, and gets the array of swapchain images, then destroys the swapchain, logical device, surface, instance, and window in that order, then exits. I have all the debug messages enabled and am logging the results of all the SDL and Vulkan commands that supply return codes, and don't see anything out-of-order. I've enabled the swapchain extension etc. and swapchain creation goes fine, it's just during destruction that things go wrong. Both the swapchain and the logical device are valid at this point according to the codes returned at their instantiations.

This crash does not occur if I use vkGetDeviceProcAddr() to get a pointer to vkDestroySwapchainKHR() and call it that way. However, I have to leave the validation layer off (see below). The vkcube and vkcubepp demos as well as a small GLFW-based application work fine in my environment, which makes me wonder if SDL has something to do with this.

If I do call vkDestroySwapchainKHR() through vkGetDeviceProcAddr() and enable the validation layer, a crash occurs in vkDestroySwapchainKHR() but in a different place—vulkan_layer_chassis::DestroySwapchainKHR() at line layers/generated/chassis.cpp:4800. In that case, layer_data->intercept_vectors is empty, so trying to access an element in it causes a crash. I'm not sure if this is the same issue in a slightly different guise or merely superficially similar.

spinnylights commented 3 years ago

I decided to try not linking with the Vulkan loader and getting all the Vulkan function addresses at runtime instead. The behavior is the same as in the latter case above—no crash if I leave the debug layer off, but a crash in the same place if I enable it. I suppose this is no surprise, since as far as I know both the loader and the validations layer library are loaded dynamically in this case, but I figured it wouldn't hurt to check.

charles-lunarg commented 3 years ago

This is an interesting issue, but it would be very helpful to have the source code used to trigger this crash to be able to reproduce the issue. Sometimes its a specific combination of inputs that trigger an issue in the loader.

I looked at wsi.c and line 487 does not correspond to VkDestroySwapchainKHR (but its not far off either). Are you sure it was version 1.2.170 or that it was line 487? (I don't doubt that it was VkDestroySwapchainKHR which crashed but best to be exactly sure what line of code the crash happened on.

spinnylights commented 3 years ago

Line 487 is for the most recent commit (3bdb32d) Sorry, I meant 479: https://github.com/KhronosGroup/Vulkan-Loader/blob/3bdb32d29f2d96b34e2865dc4a2f2be8eb457e1d/loader/wsi.c#L479 Not quite sure what I did there. In any case I can track it down for 1.2.170 as well if that would be helpful. Also, I'm happy to provide code—let me tidy this project up a bit and I'll put it on GitHub since I need to do that at some point anyway.

charles-lunarg commented 3 years ago

Ahh I may have been looking at the wrong commit then, so seems like a simple off by one error. I was just initially confused why the numbers didn't match up, but its seems like a red herring now.

And awesome, code helps a lot in any case (at least for these kinds of crashes). So no worries, take the time you need to put it working order.

spinnylights commented 3 years ago

Okay, here it is! It could use API docs and unit tests of course, but hopefully the code is easy enough to follow as-is (let me know if you have any questions). I was hoping that during my tidying-up I would happen on the source of the crash but no such luck. However, I did find something interesting—if I take out swapchain creation and run with debug utils enabled, the logger prints this:

Vulkan: create logical device...success graphics queue index: 0 present queue index: 2

Vulkan: destroying logical device...Vulkan (validations): Unloading layer library libVkLayer_khronos_validation.so

OK

Vulkan: destroying surface...OK

Vulkan: destroying debug messenger...Vulkan (validations): Validation Error: [ VUID-vkDestroyDebugUtilsMessengerEXT-messenger-parameter ] Object 0: VK_NULL_HANDLE, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0xe894c9ae | Invalid VkDebugUtilsMessengerEXT Object 0x1f. The Vulkan spec states: If messenger is not VK_NULL_HANDLE, messenger must be a valid VkDebugUtilsMessengerEXT handle (https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VUID-vkDestroyDebugUtilsMessengerEXT-messenger-parameter)

crypt_underworld: /var/tmp/portage/media-libs/vulkan-layers-9999/work/vulkan-layers-9999/layers/generated/thread_safety.h:143: std::shared_ptr counter::FindObject(T) [with T = VkDebugUtilsMessengerEXT_T*]: Assertion `object_table.contains(object)' failed.

Of course, there are a few strange things about this. Aside from the fact that the debug messenger handle ought not to be invalid there, I'm not sure why the validation library would be unloaded when the logical device is destroyed instead of later on, and the failed assertion is interestingly similar to e.g. this bug (although maybe that's just a coincidence). Also, the debug messenger prints a list of device extensions twice (not shown due to length, although I'll share the whole log if desired), starting right after the surface is created (with SDL's involvement, of course).

Anyway, with swapchain creation included in the build, it still crashes at close to the same spot, albeit a couple lines away—layers/generated/chassis.cpp:4798 in the validation layers codebase. In this case, device appears to have the value 0x3, which leads to an invalid memory access in get_dispatch_key(). I'm not quite sure which commit that changed on (in my project); I can figure it out if that would be helpful.

As before, the whole program runs and exits fine if the validation layer/debug utils are not enabled (you can turn them on and off with the -d/--debug flag). If I link to the loader again and call vkDestroySwapchainKHR() directly (i.e. not through a loaded function pointer) it crashes in the same spot as before (wsi.c:479) whether or not I enabled the debug utils, and if I link to the loader and don't create a swapchain at all the behavior is the same as if I don't link to the loader (exits okay if debug utils are off, exits with failed assertion as above if they're on).

charles-lunarg commented 3 years ago

I have just compiled & run the program and found nothing wrong with the loader. What is happening is that you are not initializing dev in the Swapchain class. Thus you pass in a garbage pointer. Simply

This would have been caught by Address Sanitizer, a VERY handy tool to have in your arsenal. All that is needed to enable it is passing -fsanitize=address to the CPPFLAGS. Addresss sanitizer is most powerful for catching memory bugs, like using uninitialized memory, use-after-free, double-free, nullptr-dereference, and stack corruption, but also includes a leak checker. I mention that because graphics drivers (specifically mesa) are known for creating a lot of false positives, so a leak summary may contain spurious entries that are unrelated to your application.

This is what Address Sanitizer reported when I ran the program. Note that it is a 'nullptr dereference' because something cleared out the memory rather than it being uninitialized. But once I got into the debugger it was immediately clear that VkDevice dev was null when it should not be.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1105340==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f85ed248902 bp 0x000000000000 sp 0x7ffedd116dd0 T0)
==1105340==The signal is caused by a READ memory access.
==1105340==Hint: address points to the zero page.
    #0 0x7f85ed248902  (/lib/x86_64-linux-gnu/libVkLayer_khronos_validation.so+0xc5f902)
    #1 0x55ecb25e46ef in cu::Swapchain::~Swapchain() ../src/swapchain.cpp:136
    #2 0x55ecb25aa19e in cu::Vulkan::~Vulkan() ../include/vulkan.hpp:53
    #3 0x55ecb25aa19e in cu::Engine::~Engine() ../include/engine.hpp:40
    #4 0x55ecb25aa19e in main ../src/main.cpp:51
    #5 0x7f860b549564 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x28564)
    #6 0x55ecb25a9b4d in _start (/home/cdgiessen/temp/crypt/crypt_underworld/build/crypt_underworld+0x9b4d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libVkLayer_khronos_validation.so+0xc5f902) 
==1105340==ABORTING
spinnylights commented 3 years ago

Well, that was a very silly mistake on my part and I'm terribly sorry to have taken up your time with it! 0_0 As it turns out, that was not the only thing I was doing wrong—I assumed that VkDebugUtilsMessengerEXT was a fundamental type like a uint64_t or something and thus would be available in Instance's destructor, but that does not seem to be the case necessarily. By the time Instance's destructor was getting called, the debug messenger already seems to have been cleared out, which was causing the other set of symptoms. I moved the debug messenger out into its own class to have more explicit control over its lifetime and that solved that problem.

Thanks for the tip about address sanitizer! I'm used to using Valgrind to troubleshoot issues like this, but I sheepishly have to admit that I hadn't run it before I filed this report—the behavior I was seeing didn't fit my intuition for what memory bugs in my code would look like and I just got carried away. I won't do that again! 0_0 Valgrind made fixing the debug messenger issue easy; having had this experience I won't commit new code without running it first, whatever things look like. It seems nice to build with address sanitizer too since that doesn't need to be run separately and appears to be lighter-weight, so they make a nice combo. I've been working on very little sleep for the past week and clearly shouldn't be trusting my intuition alone on matters like this in any case.

charles-lunarg commented 3 years ago

Its usually the silly mistakes that slip under our radar so no worries. valgrind is also a great tool, while I have my preferences for ASAN since I just leave it on in debug builds (well, relwithdebinfo actually) but any tool is better than no tool.

The confusing thing was that when I disabled the validation layers and didn't have a debug build (-Og), the code worked. When I looked at the value in the debugger, it had a value that looked similar to a vulkan handle but was not the same as the VkDevice. I wonder if this is a case of the stars aligning and causing it to work even though it should of crashed.