D3D12 Crash during replay

redorav commented 3 years ago

Description

I take a capture successfully but during replay renderdoc crashes, accessing a null command list in d3d12_initstate.cpp. I think it might be related to Reserved Resources but I don't know for sure.

// transition to copy dest
if(!barriers.empty())
  list->ResourceBarrier((UINT)barriers.size(), &barriers[0]);

This is the callstack

renderdoc.dll!D3D12ResourceManager::Apply_InitialState(ID3D12DeviceChild * live, const D3D12InitialContents & data) Line 1181   C++
renderdoc.dll!ResourceManager<D3D12ResourceManagerConfiguration>::ApplyInitialContents() Line 1351  C++
renderdoc.dll!WrappedID3D12Device::ApplyInitialContents() Line 1291 C++
renderdoc.dll!WrappedID3D12Device::ReadLogInitialisation(RDCFile * rdc, bool storeStructuredBuffers) Line 3837  C++
renderdoc.dll!ReplayController::PostCreateInit(IReplayDriver * device, RDCFile * rdc) Line 2042 C++
renderdoc.dll!ReplayController::CreateDevice(RDCFile * rdc, const ReplayOptions & opts) Line 2009   C++
renderdoc.dll!CaptureFile::OpenCapture(const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 364    C++
qrenderdoc.exe!ReplayManager::run(int proxyRenderer, const QString & capturefile, const ReplayOptions & opts, std::function<void __cdecl(float)> progress) Line 450 C++

Steps to reproduce

I need to ask permission to get a capture to you.

Environment

RenderDoc version: 05e7d1eab9e7d52b8c1524869ccd4d2fa478bed8
Operating System: Windows 10
Graphics API: D3D12

baldurk commented 3 years ago

As I mentioned in the other issue I can do absolutely nothing with this reports as it stands because there's no useful information here. You haven't provided any more details than previously, and you also haven't mentioned if you've checked the diagnostic log as I said to check if it mentions a device lost error before the crash. I'm assuming it is because a lot of device lost errors look exactly like this with wildly different causes.

Tagging this as need more info right now because nothing can happen until you either find out if you can share a capture, or if you confirm that you can't share one then provide much more information such as at least the diagnostic log and the output from running with the D3D debug layers.

redorav commented 3 years ago

I apologize, I've been having issues all morning with my remote setup and I felt like putting this in and giving more details later would get the ball rolling earlier, but you're right. Here's the replay log:

RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::91 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::92 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::2350 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::3264 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::3446 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::5299 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::12196 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::12272 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::12275 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::12276 from upload to default for efficient replay RDOC 007076: [10:43:10] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::12621 from upload to default for efficient replay RDOC 007076: [10:43:11] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::18803 from upload to default for efficient replay RDOC 007076: [10:43:11] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::18804 from upload to default for efficient replay RDOC 007076: [10:43:11] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::18807 from upload to default for efficient replay RDOC 007076: [10:43:11] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::18808 from upload to default for efficient replay RDOC 007076: [10:43:11] d3d12_device_wrap.cpp(1325) - Log - Remapping committed resource ResourceId::18809 from upload to default for efficient replay RDOC 007076: [10:43:16] d3d12_device.cpp(3357) - Error - Assertion failed: '(hr) == (((HRESULT)0L))' (hr=DXGI_ERROR_DEVICE_REMOVED, ((HRESULT)0L)=S_OK)

So it is as you suspected, but I'm not sure what would cause this. I'll try validation next. Is there anything else I can do on my side? I'm already going through the motions to share the capture.

baldurk commented 3 years ago

Yeh that makes sense then, but means we need to find the source of the device lost. Running with validation is definitely the first step, if it's related to sparse resources it may point directly to the problem (or at least to something tangible to track down).

redorav commented 3 years ago

Just as a quick heads up, I did some investigation and tried different things. The validation layers were showing a couple of things:

An invalid call to SetStablePowerState
An invalid call to UpdateTileMappings with numResourceRegions == 0

There were a few warnings removed such as clear values not matching resource and other things, nothing serious so that didn't seem to take me anywhere, and the crash still happens. The submission process is still ongoing so I'll reply once that's more advanced.

baldurk commented 3 years ago

SetStablePowerState should only be called in replay if you fetch counters - either with the drawcall timings in the event browser or the performance counter viewer. Are you doing that? Do you still get problems if you don't do that? It is a valid call (there are no parameters) but there is the completely baffling and pointless requirement that you enable """developer mode""" in windows otherwise it will remove the device.

UpdateTileMappings shouldn't be called with 0 for NumResourceRegions unless that's a replayed call from the application. The internal calls have a hardcoded value of 1 for that parameter since I bind one region at a time. I'm not sure why this would be invalid (though it would be degenerate). Can you get the callstack where that message is firing to see where it comes from?

redorav commented 3 years ago

Ah sorry, I think there's been a misunderstanding on my part, I meant I enabled the validation layers on our application side. Those two things were flagged as invalid during normal application running, and disabling them and taking another capture still showed the crash on replay.

I think what you mean however is enable validation layers on renderdoc? How would I do that?

baldurk commented 3 years ago

Oh right, yes I was assuming the application was already valid so I'd be interested to see the output from running on replay where the actual device lost and crash is happening.

If you build renderdoc in development rather than release and run in a debugger then you'll get the debug layer output from the replay.

redorav commented 3 years ago

Ok, I think we're finally getting somewhere. I've tried to highlight in bold the most relevant part.

D3D12 WARNING: ID3D12Device::CreateGraphicsPipelineState: The Pixel Shader expects a Render Target View bound to slot 0, but the PSO indicates that none will be bound. This is OK, as writes of an unbound Render Target View are discarded. It is also possible the developer knows the data will not be used anyway. This is only a problem if the developer actually intended to bind a Render Target View here. [ STATE_CREATION WARNING #679: CREATEGRAPHICSPIPELINESTATE_RENDERTARGETVIEW_NOT_SET] RDOC 022520: [13:42:43] d3d12_initstate.cpp( 949) - Debug - D3D12 not implemented - Creating init states for resources RDOC 022520: [13:42:43] resource_manager.h(1349) - Debug - Applying initial contents D3D12 ERROR: ID3D12CommandQueue::UpdateTileMappings: pHeapRangeStartOffsets[0] is 508 and the number of Tiles in the corresponding range is 12. Together these are too large given that the number of Tiles available in the heap is 512. [ EXECUTION ERROR #493: UPDATETILEMAPPINGS_INVALID_PARAMETER] D3D12: Removing Device. D3D12 WARNING: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DRIVER_INTERNAL_ERROR: There is strong evidence that the driver has performed an undefined operation; but it may be because the application performed an illegal or undefined operation to begin with.). [ EXECUTION WARNING #233: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT] Exception thrown at 0x00007FFFACD6D759 in qrenderdoc.exe: Microsoft C++ exception: _com_error at memory location 0x000000F1959FCE40. Exception thrown at 0x00007FFFACD6D759 in qrenderdoc.exe: Microsoft C++ exception: _com_error at memory location 0x000000F1959FD1B8. Exception thrown at 0x00007FFFACD6D759 in qrenderdoc.exe: Microsoft C++ exception: _com_error at memory location 0x000000F1959FDF70. Exception thrown at 0x00007FFFACD6D759 in qrenderdoc.exe: Microsoft C++ exception: _com_error at memory location 0x000000F1959FE118.

The callstack (ignoring the Qt bits) is

renderdoc.dll!rdcassert(const char msg, const char file, unsigned int line, const char func) Line 38 C++ renderdoc.dll!WrappedID3D12Device::GetNewList() Line 3357 C++ renderdoc.dll!WrappedID3D12Device::GetInitialStateList() Line 3386 C++ renderdoc.dll!D3D12ResourceManager::Apply_InitialState(ID3D12DeviceChild live, const D3D12InitialContents & data) Line 1154 C++ renderdoc.dll!ResourceManager::ApplyInitialContents() Line 1357 C++ renderdoc.dll!WrappedID3D12Device::ApplyInitialContents() Line 1291 C++ renderdoc.dll!WrappedID3D12Device::ReadLogInitialisation(RDCFile rdc, bool storeStructuredBuffers) Line 3837 C++ renderdoc.dll!D3D12Replay::ReadLogInitialisation(RDCFile rdc, bool storeStructuredBuffers) Line 188 C++ renderdoc.dll!ReplayController::PostCreateInit(IReplayDriver device, RDCFile rdc) Line 2042 C++ renderdoc.dll!ReplayController::CreateDevice(RDCFile * rdc, const ReplayOptions & opts) Line 2009 C++ renderdoc.dll!CaptureFile::OpenCapture(const ReplayOptions & opts, std::function<void cdecl(float)> progress) Line 364 C++ qrenderdoc.exe!ReplayManager::run(int proxyRenderer, const QString & capturefile, const ReplayOptions & opts, std::function<void cdecl(float)> progress) Line 450 C++ qrenderdoc.exe!ReplayManager::OpenCapture::__l2::() Line 56 C++

baldurk commented 3 years ago

OK that definitely sounds like something is broken but piecing together where it went wrong might be difficult. That code is applying the tile mappings from prior to the start of the capture, so the tracking may have gone wrong somewhere at any unknown previous point.

Do you mind if we take this to email or discord/IRC from here? This will probably involve a lot of back and forth because chances are you'd need to debug your application side to see where the tracking broke.

redorav commented 3 years ago

Sure, I've emailed you

baldurk / renderdoc