baldurk / renderdoc

RenderDoc is a stand-alone graphics debugging tool.
https://renderdoc.org
MIT License
9.11k stars 1.36k forks source link

VK_ERROR_DEVICE_LOST in ApplyInitialContents and on opening Texture or Mesh Viewer #2231

Closed mtsr closed 3 years ago

mtsr commented 3 years ago

Description

After capturing a frame from the 3d_scene example of Bevy (specific commit), renderdoc logs VK_ERROR_DEVICE_LOST errors in it's diagnostic log, fails to show the texture view and hangs with Please wait, working... if I try to inspect a different event from the Event Browser, or some other things.

This screenshot shows where renderdoc really hangs, after this.

image

This is renderdocs diagnostic log from renderdoc before I cause it to hang.

VK_ERROR_DEVICE_LOST.txt

To try to get a better idea of what's happening I compiled the latest renderdoc from git (commit cd5d0ede440aff1fcb44121a21c088b77ec64285) and tried it again with that, with the same results. Visual Studio (2019, because the right Windows SDK for 2015 isn't on MS' site anymore) gives me this stack when the first exception occurs (on running the application directly):

>   renderdoc.dll!WrappedVulkan::FlushQ() Line 351  C++
    renderdoc.dll!WrappedVulkan::ApplyInitialContents() Line 2785   C++
    renderdoc.dll!WrappedVulkan::ReplayLog(unsigned int startEventID, unsigned int endEventID, ReplayLogType replayType) Line 3434  C++
    renderdoc.dll!VulkanReplay::ReplayLog(unsigned int endEventID, ReplayLogType replayType) Line 204   C++
    renderdoc.dll!ReplayController::SetFrameEvent(unsigned int eventId, bool force) Line 81 C++
    qrenderdoc.exe!CaptureContext::SetEventID::__l2::<lambda>(IReplayController * r) Line 1507  C++
    [External Code] 
    qrenderdoc.exe!ReplayManager::run(int proxyRenderer, const QString & capturefile, const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 496 C++
    qrenderdoc.exe!ReplayManager::OpenCapture::__l2::<lambda>() Line 56 C++
    [External Code] 
    qrenderdoc.exe!LambdaThread::process() Line 345 C++
    qrenderdoc.exe!QtPrivate::FunctorCall<QtPrivate::IndexesList<>,QtPrivate::List<>,void,void (__cdecl LambdaThread::*)(void) __ptr64>::call(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 136   C++
    qrenderdoc.exe!QtPrivate::FunctionPointer<void (__cdecl LambdaThread::*)(void) __ptr64>::call<QtPrivate::List<>,void>(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 170   C++
    qrenderdoc.exe!QtPrivate::QSlotObject<void (__cdecl LambdaThread::*)(void) __ptr64,QtPrivate::List<>,void>::impl(int which, QtPrivate::QSlotObjectBase * this_, QObject * r, void * * a, bool * ret) Line 121   C++
    [External Code] 

and this on loading a saved capture:

>   renderdoc.dll!WrappedVulkan::SubmitCmds(VkSemaphore_T * * unwrappedWaitSemaphores, unsigned int * waitStageMask, unsigned int waitSemaphoreCount) Line 290  C++
    renderdoc.dll!WrappedVulkan::AddFrameTerminator(unsigned __int64 queueMarkerTag) Line 3409  C++
    renderdoc.dll!WrappedVulkan::ContextReplayLog(CaptureState readType, unsigned int startEventID, unsigned int endEventID, bool partial) Line 2676    C++
    renderdoc.dll!WrappedVulkan::ReadLogInitialisation(RDCFile * rdc, bool storeStructuredBuffers) Line 2438    C++
    renderdoc.dll!VulkanReplay::ReadLogInitialisation(RDCFile * rdc, bool storeStructuredBuffers) Line 199  C++
    renderdoc.dll!ReplayController::PostCreateInit(IReplayDriver * device, RDCFile * rdc) Line 2042 C++
    renderdoc.dll!ReplayController::CreateDevice(RDCFile * rdc, const ReplayOptions & opts) Line 2009   C++
    renderdoc.dll!CaptureFile::OpenCapture(const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 364    C++
    qrenderdoc.exe!ReplayManager::run(int proxyRenderer, const QString & capturefile, const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 450 C++
    qrenderdoc.exe!ReplayManager::OpenCapture::__l2::<lambda>() Line 56 C++
    [External Code] 
    qrenderdoc.exe!LambdaThread::process() Line 345 C++
    qrenderdoc.exe!QtPrivate::FunctorCall<QtPrivate::IndexesList<>,QtPrivate::List<>,void,void (__cdecl LambdaThread::*)(void) __ptr64>::call(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 136   C++
    qrenderdoc.exe!QtPrivate::FunctionPointer<void (__cdecl LambdaThread::*)(void) __ptr64>::call<QtPrivate::List<>,void>(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 170   C++
    qrenderdoc.exe!QtPrivate::QSlotObject<void (__cdecl LambdaThread::*)(void) __ptr64,QtPrivate::List<>,void>::impl(int which, QtPrivate::QSlotObjectBase * this_, QObject * r, void * * a, bool * ret) Line 121   C++
    [External Code] 

followed by

>   renderdoc.dll!WrappedVulkan::SubmitCmds(VkSemaphore_T * * unwrappedWaitSemaphores, unsigned int * waitStageMask, unsigned int waitSemaphoreCount) Line 290  C++
    renderdoc.dll!WrappedVulkan::ApplyInitialContents() Line 2782   C++
    renderdoc.dll!WrappedVulkan::ReplayLog(unsigned int startEventID, unsigned int endEventID, ReplayLogType replayType) Line 3434  C++
    renderdoc.dll!VulkanReplay::ReplayLog(unsigned int endEventID, ReplayLogType replayType) Line 204   C++
    renderdoc.dll!ReplayController::SetFrameEvent(unsigned int eventId, bool force) Line 81 C++
    qrenderdoc.exe!CaptureContext::SetEventID::__l2::<lambda>(IReplayController * r) Line 1507  C++
    [External Code] 
    qrenderdoc.exe!ReplayManager::run(int proxyRenderer, const QString & capturefile, const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 496 C++
    qrenderdoc.exe!ReplayManager::OpenCapture::__l2::<lambda>() Line 56 C++
    [External Code] 
    qrenderdoc.exe!LambdaThread::process() Line 345 C++
    qrenderdoc.exe!QtPrivate::FunctorCall<QtPrivate::IndexesList<>,QtPrivate::List<>,void,void (__cdecl LambdaThread::*)(void) __ptr64>::call(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 136   C++
    qrenderdoc.exe!QtPrivate::FunctionPointer<void (__cdecl LambdaThread::*)(void) __ptr64>::call<QtPrivate::List<>,void>(void(LambdaThread::*)() f, LambdaThread * o, void * * arg) Line 170   C++
    qrenderdoc.exe!QtPrivate::QSlotObject<void (__cdecl LambdaThread::*)(void) __ptr64,QtPrivate::List<>,void>::impl(int which, QtPrivate::QSlotObjectBase * this_, QObject * r, void * * a, bool * ret) Line 121   C++
    [External Code] 

followed by more (a bit too many to paste). After a number of exceptions renderdoc starts running without exception again, until I do something to trigger the actual hang, such as clicking a drawcall in the event browser, after which renderdoc hangs without exception.

I suspect there is a problem with the application I'm capturing from, but I was not expecting renderdoc to hang.

Steps to reproduce

Here is a capture from the application that reproduces the error every time for me (VK_ERROR_DEVICE_LOST.rdc).

The other file (no_VK_ERROR_DEVICE_LOST.rdc) is a capture from the last commit on bevy main (commit b6be8a5314e027a0b0f3ee48d04c14b52fe74676) that doesn't exhibit this problem. Putting the offending commit (by my hand) at 45b2db70705da24a89426e6b6e77d603a3983025 if that might help in some way.

This is the renderdoc diagnostic log for this second capture.

no_VK_ERROR_DEVICE_LOST.txt

Alternatively, with rust installed, one could clone https://github.com/bevyengine/bevy, checkout f520a341d5737600dbf89015b7729109d67cf041 (the HEAD of main at the time of writing) and build the application with cargo build --example 3d_scene. It can then be launched from target\debug\examples\3d_scene.exe with the root of the project as working dir and CARGO_MANIFEST_DIR=<path to working dir>.

I can also send over a compiled executable if desired.

Environment

gpu-z

Edit: The issue seems similar to #2216. I've tried to include as much detail as I can, but I will provide what other details I can if asked.

baldurk commented 3 years ago

FYI I'm not working for the next couple of weeks so I can look into this when I get back. In the meantime I would recommend running the validation layers to ensure you don't have any invalid API use, as that can cause RenderDoc to hang or crash as it doesn't handle invalid use of the API.

For the repro case I should hopefully be able to tell what's wrong from the capture. If I need to run the application I can, and I've been able to compile things with rust before but I've found it a pain. So a compiled executable if you can share it would be very welcome.

mtsr commented 3 years ago

FYI I'm not working for the next couple of weeks so I can look into this when I get back. In the meantime I would recommend running the validation layers to ensure you don't have any invalid API use, as that can cause RenderDoc to hang or crash as it doesn't handle invalid use of the API.

For the repro case I should hopefully be able to tell what's wrong from the capture. If I need to run the application I can, and I've been able to compile things with rust before but I've found it a pain. So a compiled executable if you can share it would be very welcome.

Thank you for your quick response. I will try the validation layers first and post back here.

If you end up needing the executable, please let me know and I'll send it to you then.

In the mean time, enjoy the time off (if that's what it is)!

mtsr commented 3 years ago

I ran with API validation, but the only thing that stood out to me was:

Core     PID  19900: [20:51:30]         vk_common.cpp(981) - Error   - Unexpected descriptor type
Core     PID  19900: [20:51:30]         vk_common.cpp(981) - Error   - Unexpected descriptor type

I verified extensively and have not been able to reproduce the specific descriptor type (found through a small code change, logging the type and equal to the maximum integer) being written by the application, or any other invalid descriptor type for that matter.

This is the full diagnostic log. API validation.txt

and the capture https://gofile.io/d/1AV4iI

baldurk commented 3 years ago

I'm not sure I follow, did you run through RenderDoc as well as enable the validation layers? You shouldn't do that as RenderDoc can change the behaviour and if your application does have invalid use then it's already undefined to use RenderDoc so the behaviour will be predictable. You should enable the validation and run your application with it directly.

mtsr commented 3 years ago

I'm not sure I follow, did you run through RenderDoc as well as enable the validation layers? You shouldn't do that as RenderDoc can change the behaviour and if your application does have invalid use then it's already undefined to use RenderDoc so the behaviour will be predictable. You should enable the validation and run your application with it directly.

Ahhh, I misunderstood. I'm pretty new to graphics programming, so I hadn't realised there was more I could do on the API side itself. I'll look into it. Thanks.

mtsr commented 3 years ago

Hi Baldur,

I've checked with renderdoc API validation, which gives me these. The checked ones I'm sure are because of renderdoc itself (https://github.com/baldurk/renderdoc/issues/582#issuecomment-295180765), the other ones I'm not 100%, but still fairly sure.

I've also verified that we're running with full vulkan validation layers in our debug builds and there are no validation errors from that.

baldurk commented 3 years ago

Loading the VK_ERROR_DEVICE_LOST.rdc I see the two draws have descriptor set 3 bound with two used descriptors containing no buffer, binding 7: StandardMaterial_reflectance_var and binding 12: StandardMaterial_emissive_var. If these were updated before the frame and it's a RenderDoc bug that they're empty I can't tell why from the capture. In the capture that doesn't break these descriptors don't exist.

Can you share the program executable that produced VK_ERROR_DEVICE_LOST.rdc? Then I can run it and reproduce from there.

mtsr commented 3 years ago

Thanks for looking into that!

I actually saw that on the pipeline state view, but since I experienced no other issues, I didn't pay it enough attention. I'll take my debugger through it, because they should be bound to a zeroed out buffer, during this pass, not before, afaik. Edit: Not a zeroed out buffer, they should just have default values, I need to figure out why those are not bound.

Let's hope this is the issue, so that I can fix my code, and maybe you have an inkling of what could cause renderdoc to hang on this, whereas the original application ran fine.

baldurk commented 3 years ago

I'm not sure what you mean that you experienced no other issues, I thought you were seeing the device lost? I can certainly reproduce it.

I didn't see any descriptor updates in the capture so I assumed they were being initialised on startup and not within the captured frame. If the descriptors haven't been updated then that's definitely invalid since they are used, and I'm surprised the validation layers don't catch it. They shouted immediately when I ran them on the replay. If it is invalid behaviour then effectively all bets are off and it may well work by coincidence in the application and not renderdoc due to unknown variables.

If you share the application I can take a look and see, it should be easy to pick up if the descriptor sets are being written and if so why renderdoc doesn't contain the proper bindings. Otherwise I can wait until you've looked into it from your side.

mtsr commented 3 years ago

I'm not sure what you mean that you experienced no other issues, I thought you were seeing the device lost? I can certainly reproduce it.

I meant my application is working fine. I do indeed experience the renderdoc issue.

I didn't see any descriptor updates in the capture so I assumed they were being initialised on startup and not within the captured frame. If the descriptors haven't been updated then that's definitely invalid since they are used, and I'm surprised the validation layers don't catch it. They shouted immediately when I ran them on the replay. If it is invalid behaviour then effectively all bets are off and it may well work by coincidence in the application and not renderdoc due to unknown variables.

If you share the application I can take a look and see, it should be easy to pick up if the descriptor sets are being written and if so why renderdoc doesn't contain the proper bindings. Otherwise I can wait until you've looked into it from your side.

The application is here, in two versions, debug (additionally uses validation layers) and release. https://gofile.io/d/1UQ3g5

kvark commented 3 years ago

If some resources are not bound, and the VVLs appear to miss this for some reason, and the application keeps running, then this would be a wgpu issue (as opposed to RenderDoc). I'm surprised though, I thought binding checks are one of the first VVL got, and should be solid.

baldurk commented 3 years ago

Thanks, that showed the issue immediately. It was related to the Unexpected descriptor type errors before - vulkan has a little feature that allows you to overrun descriptor writes from one binding to another as long as the type is compatible (in this case all buffers). I had implemented that but only assuming tightly-packed bindings, i.e. binding 0 rolling over into binding 1. In your case the bindings were sparse so instead of 0 rolling over into 3 it rolled over into where 1 should be and threw that error - also not properly then recording the results of the update. I guess this feature is rarely used so no-one had run into this before.

That commit should fix the rollover behaviour to properly advance to the next binding.

kvark commented 3 years ago

That makes sense. Yeah, vulkan descriptor updates are tricky. I'm surprised it only showed up to be a problem now in RD. Thanks for the quick fix!

mtsr commented 3 years ago

Thank you very much!

The more I use renderdoc, the more I come to depend on it. Thanks for your hard work!

mtsr commented 3 years ago

I just built the latest commit and can confirm the fix works, also for other cases using Bevy.

mtsr commented 3 years ago

Hi Baldur,

Any idea when you might make a new release? We're seeing a pretty substantial number of users of Bevy wanting to use renderdoc but running into this bug. I point them towards building it, but not everyone does.

Thanks. Cheers,

Jonas

baldurk commented 3 years ago

I've been making releases regularly every 2 months for the past year or so, I don't make stable release for particular bugfixes unless it's something really serious otherwise I'd end up doing releases at an infeasible cadence.

Builds are made nightly from the latest branch so if you want a pre-made build you can download one of them.

mtsr commented 3 years ago

That's fine. Thanks for pointing out the nightlies, I hadn't seen those!