KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
731 stars 397 forks source link

Reproducible device lost and reproducible DLL crash in toplevel #3681

Open BestYeen opened 2 years ago

BestYeen commented 2 years ago

Describe the Issue I am using the Vulkan Configurator to validate my test program where I build sample code for various usage techniques with Vulkan. I have a reproducible device lost when working with shader-based validation and a subpass input. Another problem is a reproducible crash with an invalid memory access in VkLayer_khronos_validation.dll when resizing my window and remaking the swapchain.

Valid Usage ID N/A

Environment:

Additional context I've taken part in the recent LunarG Vulkan survey and was contacted with a request to file a crash report. Today I've updated my drivers and the SDK to the most recent versions, made sure that no validation and synchronization validation errors are left, and reproduced these problems.

My sample with code excerpts is at http://hai.dogpixels.net/temp-LunarG/2022-01-16-Hs-Vulkan-Test.zip Notes.txt in there illustrates how to get into the crash situations. A visible part of my Vulkan experiments is documented in this retweet chain: https://twitter.com/BestYeen/status/1475964338464768001

Best regards and keep up the awesome work!

ncesario-lunarg commented 2 years ago

Thanks for the issue @BestYeen and taking part in the survey! And thank your for the app to reproduce the issue! Unfortunately I'm getting the following when trying to run it:

set renderer: my first triangle
t.exe: user error ([(0,"NVIDIA GeForce GTX 1650"),(1,"Intel(R) UHD Graphics 630")])

Does the app depend on something specific to your hardware? Do you perhaps have a .cabal file or some other means for running the app from source?

BestYeen commented 2 years ago

Oh, I should have mentioned that! If it detects more than one physical device, it lists them with indexes. Just run t.exe 0 (0 as the first arg from cmd) to select the discrete graphics card in this case. :)

A build from source would unfortunately be a little more difficult as my project is set up a little non-standard. We can look into that if all else fails.

ncesario-lunarg commented 2 years ago

Thanks for the quick response. If I understand correctly, you are seeing the crash when resizing the window? I attempted to reproduce this using the settings specified in Notes.txt, but was not able to. It looks like the following layers are available on your system. Do you know if any of them are running when you're seeing the crash?

BestYeen commented 2 years ago

Renderdoc, OBS, Steam, etc. should all be off with me trying this out. But yeah, it might very well be that some hook I am not aware of is causing this, or some system service I've disabled ages ago. My system usually runs very well, has no hardware problems, no crashes or anything.

Sometimes the program doesn't crash if I use another option first. I've tried this with Haskell's traditional copying GC and the new nonmoving GC, and I think I can rule them out. The thread is also bound, which makes things more deterministic and could be safer for so much FFI usage, I think.

It does seem to always point to the same stack frames in the dll. Would there be a way to record the crash in a minidump somehow?

ncesario-lunarg commented 2 years ago

If you are able to build a debug version of validation layers, that should tell us exactly where the crash is occurring when running your exe attached to the VS debugger. And/or, if you can reproduce the crash while recording a playback using gfxreconstruct, that might also allow us to reproduce the crash on our end (as I stated earlier, I was unable to reproduce the crash on my system).

BestYeen commented 2 years ago

Using gfxreconstruct first and synchronization validation second, it doesn't crash. Using synchronization with edited layers to also involve gfxreconstruct and reorder them a little... that kept the crash: http://hai.dogpixels.net/temp-LunarG/gfxrecon_capture_20220120T004841.gfxr http://hai.dogpixels.net/temp-LunarG/t.txt This crash seems to depend on the order in which the layers are used. Maybe this gfxr file helps.

Building a debug version of a validation dll or putting VS to use again is a little beyond me at the moment...

^ that was the access violation v this is "device lost" (R, 4) http://hai.dogpixels.net/temp-LunarG/gfxrecon_capture_20220120T010718.gfxr

BestYeen commented 2 years ago

A few months later now, I've discovered that some of my samples in question subtly over-allocated from descriptor pools. It mostly worked fine on a very forgiving Nvidia driver but crashed instantly on an Intel UHD system with no discrete graphics card. I don't know if it was the reason for all that, but it's a good candidate for causing crashes - and also for validation by the SDK tools. :)

spencer-lunarg commented 3 months ago

@BestYeen I spent some time and tried to reproduce this but wasn't able to with the latest 1.3.280.1 SDK

This was from a very old version of VVL and not sure if this was "just fixed" somewhere in the last year