baldurk / renderdoc

RenderDoc is a stand-alone graphics debugging tool.
https://renderdoc.org
MIT License
8.86k stars 1.33k forks source link

RenderDoc crashes when loading Vulkan capture #703

Closed nsubtil closed 6 years ago

nsubtil commented 7 years ago

One of the apps we're working on generates RenderDoc captures that cause the UI to crash when trying to load them. Here is the relevant portion of the stack trace:

0c 00000000`2e67d0f0 00007ffb`f0c0ed42 : 00000000`26477c60 00000000`2649eef0 00000000`226b0690 00000000`00000000 : renderdoc!WrappedVulkan::Serialise_vkAllocateMemory+0x23d [c:\users\nsubtil\src\renderdoc\renderdoc\driver\vulkan\wrappers\vk_resource_funcs.cpp @ 163] 
0d 00000000`2e67d940 00007ffb`f0c175b7 : 00000000`26477c60 00000000`00024091 00000000`0000000a 00007ffb`0000000a : renderdoc!WrappedVulkan::ProcessChunk+0x172 [c:\users\nsubtil\src\renderdoc\renderdoc\driver\vulkan\vk_core.cpp @ 1987] 
0e 00000000`2e67dc00 00007ffb`f0d6579f : 00000000`26477c60 cccccccc`cccccccc cccccccc`cccccccc cccccccc`cccccccc : renderdoc!WrappedVulkan::ReadLogInitialisation+0x237 [c:\users\nsubtil\src\renderdoc\renderdoc\driver\vulkan\vk_core.cpp @ 1688] 
0f 00000000`2e67df90 00007ffb`efdbf3f5 : 00000000`26477d40 cccccccc`cccccccc cccccccc`cccccccc cccccccc`cccccccc : renderdoc!VulkanReplay::ReadLogInitialisation+0x2f [c:\users\nsubtil\src\renderdoc\renderdoc\driver\vulkan\vk_replay.cpp @ 663] 
10 00000000`2e67dfc0 00007ffb`efdb7fad : 00000000`26530ff0 00000000`26477d40 00007ffb`f10b21c0 00000000`000005ff : renderdoc!ReplayController::PostCreateInit+0x85 [c:\users\nsubtil\src\renderdoc\renderdoc\replay\replay_controller.cpp @ 1561] 
11 00000000`2e67e500 00007ffb`efd8d5cc : 00000000`26530ff0 00000000`264ca1c0 cccccccc`cccccccc cccccccc`cccccccc : renderdoc!ReplayController::CreateDevice+0x1fd [c:\users\nsubtil\src\renderdoc\renderdoc\replay\replay_controller.cpp @ 1536] 
12 00000000`2e67e600 00007ffb`efd935bd : 00000000`264b80b0 00000000`2e67e748 00000000`03a505f0 cccccccc`cccccccc : renderdoc!CaptureFile::OpenCapture+0x12c [c:\users\nsubtil\src\renderdoc\renderdoc\replay\capture_file.cpp @ 105] 
13 00000000`2e67e6f0 00007ffb`f4536b5b : 00000000`264bc0e0 00000000`03a505f0 00000000`2e67e8f0 00000000`03e84228 : renderdoc!RENDERDOC_CreateReplayRenderer+0x8d [c:\users\nsubtil\src\renderdoc\renderdoc\replay\entry_points.cpp @ 360] 

Crash occurs during a call to vkAllocateMemory in vk_resource_funcs.cpp:163:

    // serialised memory type index is non-remapped, so we remap now.
    // PORTABILITY may need to re-write info to change memory type index to the
    // appropriate index on replay
    info.memoryTypeIndex = m_PhysicalDeviceData.memIdxMap[info.memoryTypeIndex];

    VkResult ret = ObjDisp(device)->AllocateMemory(Unwrap(device), &info, NULL, &mem); <--- crash

The problem seems to be that info.memoryTypeIndex is invalid (0xffffffff). The app being captured works fine, this particular memoryTypeIndex doesn't show up in the API trace itself.

I can make the capture file available if that would help. Stack trace is from commit d539fed918, though the same crash happens on the latest release version. Similar symptoms show up on several NVIDIA GPUs (haven't tested any other vendors).

baldurk commented 7 years ago

Can you post the full output log somewhere, like gist? I think this could be caused by replaying a capture made on one GPU/driver combo on an incompatibly different GPU/driver combo (such that the memory types don't match up).

nsubtil commented 7 years ago

Here is the log: https://gist.github.com/nsubtil/445df967fd53a107a406620df06ad0c6

baldurk commented 7 years ago

Right, so I think this is the key problem:

Captured log describes physical device 0:
   - Intel(R) HD Graphics 530 (ver 0.16 patch 0x3) - 8086:191b
Mapping during replay to physical device 0:
   - GeForce GTX 1060 (ver 382.83 patch 0x0) - 10de:1c20
Captured log describes physical device 1:
   - GeForce GTX 1060 (ver 382.83 patch 0x0) - 10de:1c20
Mapping during replay to physical device 0:
   - GeForce GTX 1060 (ver 382.83 patch 0x0) - 10de:1c20
Captured log describes physical device 2:
   - GeForce GTX 1080 (ver 382.83 patch 0x0) - 10de:1b80
Mapping during replay to physical device 1:
   - GeForce GTX 1080 (ver 382.83 patch 0x0) - 10de:1b80

On capture the physical devices were: 0: Intel IGPU, 1: GTX 1060, 2: GTX 1080. On replay it looks like the physical devices are just 0: GTX 1060, 1: GTX 1080. If there was an Intel IGPU available (maybe as physical device 2) then renderdoc should have remapped to it, but it looks like it's not available at all.

In theory this isn't a problem if the application explicitly selected the nvidia card, and ignored the intel card. However given the memory type problem you're running into it sounds like the capture used the intel card and there was nothing to map it to on replay.

As to why you're not getting the same physical devices enumerated on capture and replay (or why renderdoc isn't remapping the intel card) - that I don't know. I know nvidia has a layer which messes with the order but I thought it would just re-order the list to prioritise the physical device it wanted to favour for the application, rather than outright remove a physical device.

Are you able to recompile renderdoc? I could give you a branch with extra logging enabled that could eliminate any doubt about which physical devices were available on capture and replay.

nsubtil commented 7 years ago

Yes, building from source works fine. I can certainly try a different branch if you point me at it.

baldurk commented 7 years ago

OK I've added a branch vk_phys_logging which prints to the log which physical devices are available on capture and replay. Can you try running with that and see what it prints out?

nsubtil commented 7 years ago

Apologies for the delay, here's the log from vk_phys_logging: https://gist.github.com/nsubtil/cae1592291452713d97098a53661296f

On Tue, Jul 25, 2017 at 12:46 PM, Baldur Karlsson notifications@github.com wrote:

OK I've added a branch vk_phys_logging https://github.com/baldurk/renderdoc/tree/vk_phys_logging which prints to the log which physical devices are available on capture and replay. Can you try running with that and see what it prints out?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/baldurk/renderdoc/issues/703#issuecomment-317851067, or mute the thread https://github.com/notifications/unsubscribe-auth/AF66M4q9fsxipamBuyIwmZQU06HAhLsyks5sRkYkgaJpZM4Oh1_x .

baldurk commented 7 years ago

OK yeh, so that confirms the problem. The intel card is available at physical device 0 on capture, but is completely missing from replay:

RDOC 015516: [23:13:41]  vk_device_funcs.cpp( 439) - Log     - [0] - Intel(R) HD Graphics 530 (ver 0.16 patch 0x3) - 8086:191b
RDOC 015516: [23:13:41]  vk_device_funcs.cpp( 439) - Log     - [1] - GeForce GTX 1060 (ver 382.83 patch 0x0) - 10de:1c20
RDOC 015516: [23:13:41]  vk_device_funcs.cpp( 439) - Log     - [2] - GeForce GTX 1080 (ver 382.83 patch 0x0) - 10de:1b80
...
RDOC 011316: [23:14:30]  vk_device_funcs.cpp( 275) - Log     - During Replay 2 physical devices:
RDOC 011316: [23:14:30]  vk_device_funcs.cpp( 284) - Log     - [0] - GeForce GTX 1060 (ver 382.83 patch 0x0) - 10de:1c20
RDOC 011316: [23:14:30]  vk_device_funcs.cpp( 284) - Log     - [1] - GeForce GTX 1080 (ver 382.83 patch 0x0) - 10de:1b80

Currently it's impossible to replay on a significantly different device (like a different IHV) so this makes the replay fail.

I don't know why the physical devices reported are changing, maybe this is something nvidia's layer is doing - if so that is actively hostile behaviour. I thought it just re-ordered the devices, which still isn't great since applications should choose themselves, but at least it meant everything is still reported.

This is beyond RenderDoc's ability to fix though. All I can suggest is that either in code you avoid using the Intel physical device if possible, or locally just disable the intel vulkan driver if you don't need it.

baldurk commented 6 years ago

I believe physical device remapping should now be solid in v1.0, as long as there isn't a completely incompatible set of devices - that's still unsupported.