Closed mtsr closed 3 years ago
FYI I'm not working for the next couple of weeks so I can look into this when I get back. In the meantime I would recommend running the validation layers to ensure you don't have any invalid API use, as that can cause RenderDoc to hang or crash as it doesn't handle invalid use of the API.
For the repro case I should hopefully be able to tell what's wrong from the capture. If I need to run the application I can, and I've been able to compile things with rust before but I've found it a pain. So a compiled executable if you can share it would be very welcome.
FYI I'm not working for the next couple of weeks so I can look into this when I get back. In the meantime I would recommend running the validation layers to ensure you don't have any invalid API use, as that can cause RenderDoc to hang or crash as it doesn't handle invalid use of the API.
For the repro case I should hopefully be able to tell what's wrong from the capture. If I need to run the application I can, and I've been able to compile things with rust before but I've found it a pain. So a compiled executable if you can share it would be very welcome.
Thank you for your quick response. I will try the validation layers first and post back here.
If you end up needing the executable, please let me know and I'll send it to you then.
In the mean time, enjoy the time off (if that's what it is)!
I ran with API validation, but the only thing that stood out to me was:
Core PID 19900: [20:51:30] vk_common.cpp(981) - Error - Unexpected descriptor type
Core PID 19900: [20:51:30] vk_common.cpp(981) - Error - Unexpected descriptor type
I verified extensively and have not been able to reproduce the specific descriptor type (found through a small code change, logging the type and equal to the maximum integer) being written by the application, or any other invalid descriptor type for that matter.
This is the full diagnostic log. API validation.txt
and the capture https://gofile.io/d/1AV4iI
I'm not sure I follow, did you run through RenderDoc as well as enable the validation layers? You shouldn't do that as RenderDoc can change the behaviour and if your application does have invalid use then it's already undefined to use RenderDoc so the behaviour will be predictable. You should enable the validation and run your application with it directly.
I'm not sure I follow, did you run through RenderDoc as well as enable the validation layers? You shouldn't do that as RenderDoc can change the behaviour and if your application does have invalid use then it's already undefined to use RenderDoc so the behaviour will be predictable. You should enable the validation and run your application with it directly.
Ahhh, I misunderstood. I'm pretty new to graphics programming, so I hadn't realised there was more I could do on the API side itself. I'll look into it. Thanks.
Hi Baldur,
I've checked with renderdoc API validation, which gives me these. The checked ones I'm sure are because of renderdoc itself (https://github.com/baldurk/renderdoc/issues/582#issuecomment-295180765), the other ones I'm not 100%, but still fairly sure.
I've also verified that we're running with full vulkan validation layers in our debug builds and there are no validation errors from that.
Loading the VK_ERROR_DEVICE_LOST.rdc
I see the two draws have descriptor set 3 bound with two used descriptors containing no buffer, binding 7: StandardMaterial_reflectance_var
and binding 12: StandardMaterial_emissive_var
. If these were updated before the frame and it's a RenderDoc bug that they're empty I can't tell why from the capture. In the capture that doesn't break these descriptors don't exist.
Can you share the program executable that produced VK_ERROR_DEVICE_LOST.rdc
? Then I can run it and reproduce from there.
Thanks for looking into that!
I actually saw that on the pipeline state view, but since I experienced no other issues, I didn't pay it enough attention. I'll take my debugger through it, because they should be bound to a zeroed out buffer, during this pass, not before, afaik. Edit: Not a zeroed out buffer, they should just have default values, I need to figure out why those are not bound.
Let's hope this is the issue, so that I can fix my code, and maybe you have an inkling of what could cause renderdoc to hang on this, whereas the original application ran fine.
I'm not sure what you mean that you experienced no other issues, I thought you were seeing the device lost? I can certainly reproduce it.
I didn't see any descriptor updates in the capture so I assumed they were being initialised on startup and not within the captured frame. If the descriptors haven't been updated then that's definitely invalid since they are used, and I'm surprised the validation layers don't catch it. They shouted immediately when I ran them on the replay. If it is invalid behaviour then effectively all bets are off and it may well work by coincidence in the application and not renderdoc due to unknown variables.
If you share the application I can take a look and see, it should be easy to pick up if the descriptor sets are being written and if so why renderdoc doesn't contain the proper bindings. Otherwise I can wait until you've looked into it from your side.
I'm not sure what you mean that you experienced no other issues, I thought you were seeing the device lost? I can certainly reproduce it.
I meant my application is working fine. I do indeed experience the renderdoc issue.
I didn't see any descriptor updates in the capture so I assumed they were being initialised on startup and not within the captured frame. If the descriptors haven't been updated then that's definitely invalid since they are used, and I'm surprised the validation layers don't catch it. They shouted immediately when I ran them on the replay. If it is invalid behaviour then effectively all bets are off and it may well work by coincidence in the application and not renderdoc due to unknown variables.
If you share the application I can take a look and see, it should be easy to pick up if the descriptor sets are being written and if so why renderdoc doesn't contain the proper bindings. Otherwise I can wait until you've looked into it from your side.
The application is here, in two versions, debug (additionally uses validation layers) and release. https://gofile.io/d/1UQ3g5
If some resources are not bound, and the VVLs appear to miss this for some reason, and the application keeps running, then this would be a wgpu
issue (as opposed to RenderDoc). I'm surprised though, I thought binding checks are one of the first VVL got, and should be solid.
Thanks, that showed the issue immediately. It was related to the Unexpected descriptor type
errors before - vulkan has a little feature that allows you to overrun descriptor writes from one binding to another as long as the type is compatible (in this case all buffers). I had implemented that but only assuming tightly-packed bindings, i.e. binding 0 rolling over into binding 1. In your case the bindings were sparse so instead of 0 rolling over into 3 it rolled over into where 1 should be and threw that error - also not properly then recording the results of the update. I guess this feature is rarely used so no-one had run into this before.
That commit should fix the rollover behaviour to properly advance to the next binding.
That makes sense. Yeah, vulkan descriptor updates are tricky. I'm surprised it only showed up to be a problem now in RD. Thanks for the quick fix!
Thank you very much!
The more I use renderdoc, the more I come to depend on it. Thanks for your hard work!
I just built the latest commit and can confirm the fix works, also for other cases using Bevy.
Hi Baldur,
Any idea when you might make a new release? We're seeing a pretty substantial number of users of Bevy wanting to use renderdoc but running into this bug. I point them towards building it, but not everyone does.
Thanks. Cheers,
Jonas
I've been making releases regularly every 2 months for the past year or so, I don't make stable release for particular bugfixes unless it's something really serious otherwise I'd end up doing releases at an infeasible cadence.
Builds are made nightly from the latest branch so if you want a pre-made build you can download one of them.
That's fine. Thanks for pointing out the nightlies, I hadn't seen those!
Description
After capturing a frame from the 3d_scene example of Bevy (specific commit), renderdoc logs VK_ERROR_DEVICE_LOST errors in it's diagnostic log, fails to show the texture view and hangs with
Please wait, working...
if I try to inspect a different event from the Event Browser, or some other things.This screenshot shows where renderdoc really hangs, after this.
This is renderdocs diagnostic log from renderdoc before I cause it to hang.
VK_ERROR_DEVICE_LOST.txt
To try to get a better idea of what's happening I compiled the latest renderdoc from git (commit cd5d0ede440aff1fcb44121a21c088b77ec64285) and tried it again with that, with the same results. Visual Studio (2019, because the right Windows SDK for 2015 isn't on MS' site anymore) gives me this stack when the first exception occurs (on running the application directly):
and this on loading a saved capture:
followed by
followed by more (a bit too many to paste). After a number of exceptions renderdoc starts running without exception again, until I do something to trigger the actual hang, such as clicking a drawcall in the event browser, after which renderdoc hangs without exception.
I suspect there is a problem with the application I'm capturing from, but I was not expecting renderdoc to hang.
Steps to reproduce
Here is a capture from the application that reproduces the error every time for me (VK_ERROR_DEVICE_LOST.rdc).
The other file (no_VK_ERROR_DEVICE_LOST.rdc) is a capture from the last commit on bevy main (commit
b6be8a5314e027a0b0f3ee48d04c14b52fe74676
) that doesn't exhibit this problem. Putting the offending commit (by my hand) at45b2db70705da24a89426e6b6e77d603a3983025
if that might help in some way.This is the renderdoc diagnostic log for this second capture.
no_VK_ERROR_DEVICE_LOST.txt
Alternatively, with rust installed, one could clone https://github.com/bevyengine/bevy, checkout
f520a341d5737600dbf89015b7729109d67cf041
(the HEAD of main at the time of writing) and build the application withcargo build --example 3d_scene
. It can then be launched fromtarget\debug\examples\3d_scene.exe
with the root of the project as working dir andCARGO_MANIFEST_DIR=<path to working dir>
.I can also send over a compiled executable if desired.
Environment
Edit: The issue seems similar to #2216. I've tried to include as much detail as I can, but I will provide what other details I can if asked.