Closed StefanPoelloth closed 3 months ago
@StefanPoelloth thanks for the report. "Resource" field it's a new feature. Previously submit time validation did not have access to resource handles. Now when the race condition is detected in some memory location it has access to the resource associated with this memory region. API dump would be helpful, can you upload it to LunarG sharing portal https://share.lunarg.com/ ?
Any chance that application uses Vulkan memory aliasing (the same memory object is bound to multiple resources)?
@artem-lunarg Yes we use VMA for allocations. Ive just created an account, but it tells me "You are not setup to access File Share.".
@StefanPoelloth sorry for the confusing, it should be done in a different way. Could you provide your email address, it will be used to create invitation for upload.
Yes we use VMA for allocations. Ive just created an account, but it tells me "You are not setup to access File Share.".
@artem-lunarg needs to create a folder for sharing files and then invite you (via email) to the folder.
The invitation is sent.
Thanks, I received the API dump.
@StefanPoelloth Is it possible to create a gfx reconstruct capture (available as Frame Capture layer in vkconfig), so I can debug actual Vulkan command. Assuming it's not too sensitive for sharing.
@artem-lunarg Ive uploaded a gfx reconstruct capture. I had to disable page_guard though, with page_guard the process just froze before displaying an image. Hope thats fine.
Thanks, I can run the capture. Unfortunately I can not reproduce the issue (no sync validation errrors). Tried both the latest VVL code and commit mentioned in the issue description. Maybe there is some difference comparing to running actual app. If no luck will use api dump for investigation, that's also very helpful.
@artem-lunarg Ive uploaded a longer capture with the validation errors logged. The problem is a bit difficult to reproduce but i made sure that the error was happening at least 3 times while recording the capture file.
@StefanPoelloth from the attached log file it looks like GPU-Assisted validation is enabled (validation layer: Validation Warning: [ WARNING-GPU-Assisted-Validation ]
). Usually the recommendation is to run Synchronization Validation separately, in theory it should work together but in practice it's not well tested. Is synchronization validation error reproducible if only Synchronization preset is enabled?
@artem-lunarg i couldnt reproduce it without gpu assisted validation. I did always run it with gpu av until now.
Okay, then it might be poor interaction between SyncVal and GPU-AV. GPU assisted validation instruments shaders and adds new descriptors sets. These additional descriptor sets are not accounted by SyncVal and it can misinterpret resources used by those sets. Timeline semaphore feature used by GPU-AV is also not supported by SyncVal yet (support is planned later this year). It's possible that entire SyncVal message is a false-positive, not only "resource" field is wrong. Sorry for this issue, but currently it's not much guarantees how GPU-AV interacts with SyncVal. Most of the efforts is to provide solid baseline for each of them separately. Hopefully we can improve interaction in the future.
@artem-lunarg I was able to reproduce it with GPU_BASED_NONE
, it just took much longer. I used these settings:
vk_layer_settings.txt
@StefanPoelloth Thanks for the confirmation. I will continue to look into this. If that's a bug it would be nice to fix it for the new SDK.
@StefanPoelloth So far I can't reproduce the issue, will spend some time on the analysis of the code but if no luck it might stuck for a while until we can get a good repro case.
The "resource" field is a new feature, but hazard detection should not be affected by it (you can ignore "resource" part of the message). It still might be a good idea to check if the reported race condition actually happens. It says that after the initial write to a buffer the barrier was set that allows VERTEX/COMPUTE shader to READ, but the last access (submitted usage) was COPY READ (source parameter of vkCmdCopyBuffer) and the COPY stage is not protected by the barrier.
There is a chance that GpuArray MatrixStore _primaryArray
name is detected properly, so can be a hint it's about this buffer. seq_no
can give some insights too. It is the index of the command in the command buffer. It indexes only commands that perform memory accesses (e.g. CmdCopyBuffer/CmdDraw/CmdDispatch but not CmdSetViewport). Because in the reported messages seq_no
is quite small it should be possible to find manually the pair of commands that create a race condition.
Can confirm using the API dump that initial write (prior_usage: SYNC_COPY_TRANSFER_WRITE, seq_no: 5, reset_no: 8) was into 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray])
, and not into 0xad2b50000000316[PrimaryCullPass occludeds]
. Line 1133339 in the dump file.
@artem-lunarg Yes, the 0xad2b50000000316[PrimaryCullPass occludeds]
is only written from the compute shader.
We have done extensive analysis and I'm pretty confident that all barriers (for buffer 0x17758085a10[GpuArray MatrixStore _primaryArray]) are correctly placed. Ive created a short API dump with 7 frames where frame 7 is giving me this sync hazard:
My analysis shows that the buffer 0x17758085a10[GpuArray MatrixStore _primaryArray]
is written to in frame 7 (and not written in frame 6). The write is protected with a barrier before: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite
and two barriers after copy/transferWrite -> compute/storageRead
and copy/transferWrite -> vertex/storageRead
.
My conclusion is that all barriers are correct and the validation error is a false positive.
My API dump: dump6.zip
My complete dump analysis for 0x17758085a10 for the 7 frames:
I think the GPU-AV tag should be removed since its happening without GPU-AV.
@StefanPoelloth Thanks for details! I also have done analysis of the barriers based on original dump and it looks correct for me. So it could be two problems here. False-positive error and incorrect labeling of the resource.
Additional documentation. Critical sequence from original dump for buffer 6EEA0C0000000185 that syncval complains about:
Begin: cmdbuf 000002D4EF775860
CmdCopyBuffer: dstBuffer [6EEA0C0000000185] // prior access
buffer barrier [6EEA0C0000000185]:
srcStageMask : VK_PIPELINE_STAGE_2_COPY_BIT
srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
dstStageMask : VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT
dstAccessMask : VK_ACCESS_2_SHADER_STORAGE_READ_BIT
buffer barrier [6EEA0C0000000185]:
srcStageMask : VK_PIPELINE_STAGE_2_COPY_BIT
srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
dstStageMask : VK_PIPELINE_STAGE_2_VERTEX_SHADER_BIT
dstAccessMask : VK_ACCESS_2_SHADER_STORAGE_READ_BIT
End: cmdbuf 000002D4EF775860
QueueSubmit: cmdbuf 000002D4EF775860
Begin: cmdbuf 000002D4EF7FD510
buffer barrier [6EEA0C0000000185]:
srcStageMask : VK_PIPELINE_STAGE_2_COPY_BIT
srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
dstStageMask : VK_PIPELINE_STAGE_2_COPY_BIT
dstAccessMask : VK_ACCESS_2_TRANSFER_READ_BIT
CmdCopyBuffer: srcBuffer [6EEA0C0000000185] // last submitted access
End: cmdbuf 000002D4EF7FD510
QueueSubmit: cmdbuf 000002D4EF7FD510
@StefanPoelloth could you clarify few points: a) You mentioned sometimes it's hard to reproduce. Is behavior non-deterministic and sometimes it runs without errors? b) In those cases that detect the error, does application need to run for some time until it happens or is it usually in the first frames (in the provided messages the errors were reported pretty early, so I'm wondering if there is such correlation).
@artem-lunarg
a) With gpu-av it happened almost always in the first few frames. But when I disabled gpu-av, it was hard to reproduce. However I've tweaked my test case where its happening reliable in the first, lets say, ~20 frames. It will also happen very often, in a random run it happened on frames: 4, 20, 28, 34, 42, 47... I get around 125 errors in 1000 frames.
The buffers in prior_usage
keep changing too, but it doesn't look random. Up until now i only saw buffers and sometimes images that are used the compute pass right after the update.
Some context:
if (_time % 0.1f < 0.05f)
to move objects which eventually results in the barriers and buffer copies.The last point results in the following updates (a random run, not related to any capture):
b) I can reliable reproduce it in the first few (~20) frames on every application run. Usually it happens between frame 3 and 8.
EDIT: I did some testing when to update the _primaryArray:
if (_frame % 2 == 0)
if (_frame % 3 == 0)
it happens every 3rd frame.thanks!
I wrote a test that simulates how 0x6eea0c0000000185
is updated (includes buffer copies, and barriers are generated based on whether the buffer was updated to match the pattern from the dump). So far I can't catch the issue. It's interesting I get exactly the same error as reported in this issue when I remove the barrier that protects 0x6eea0c0000000185
when it is used as a copy source (in the attached pseudo code it is the first barrier in most frames). Still in the API dump that barrier is properly generated when necessary.
@StefanPoelloth if you have opportunity to check the latest VVL code (new SDK will also be released soon) it would be interesting if it changes something. In the latest code syncval validation of descriptor accesses is disabled by default, because it can produce false-positives (old behavior can be enabled with setting VK_KHRONOS_VALIDATION_SYNCVAL_SHADER_ACCESSES_HEURISTIC
environment variable to 1). My impression this issue is not related to shader accesses but it still would be good to get confirmation.
Pseudo code I use for investigation for documentation purposes: buffer-frames.txt
@artem-lunarg Ive tested with a6d3fc5 yesterday and today with edcf314 and i can confirm its happening with "submit time validation" enabled and "shader access heuristic" disabled. Its not happening with "submit time validation" disabled and "shader access heuristic" enabled.
Ive tried to remove rendering code step by step and if i remove all rendering passes (draw, dispatch and everything related) and only leave in the commandbuffer creation, updating buffers and EndCommandBuffer. When i do that i get the following assertion:
Callstack:
@StefanPoelloth Thanks for the confirmation, that's very helpful. Yes, this entire issue is related to "submit time validation" so makes sense to test with this option always enabled. It's valuable information that it is happening with "shader accesses" disabled - simplifies testing scenarios we might need to consider.
About assertion, I suspected this scenario and was going to disable "resource" reporting for this SDK release, and it's one more good confirmation.
p.s. I won't be available for the next few weeks, so might be not much progress here and probably won't be fixed for this SDK, but that's something we definitely will target to fix for the SDK after that.
@StefanPoelloth do you run syncval alone or together with Core validation enabled (core checkbox)? Any Core validation errors?
Core validation errors can put syncval into inconsistent state and it can produce false-positives. The recommendation is to run core validation error free before enabling syncval. Syncval should not crash in this scenario though.
@artem-lunarg Core validation is disabled, everything except "Synchronization" and "Submit time validation" is disabled.
Thanks, "Handle Wrapping" is good to have enabled though, it's a less tested path when it is disabled.
@artem-lunarg Ive reduces the amount of api calls drastically and uploaded a new api dump to the share. Were working on repro code that i can share, but unfortunately we had no luck yet.
@StefanPoelloth thanks for putting efforts into this. I have some ideas what could be wrong with the label and trying to come up with a repro case. It's more tricky with the reported error itself. Both API-dump and error message suggests that there was a barrier before copy so READs should be protected from previous WRITEs but somehow it did not work.
@StefanPoelloth if that's something that can be quickly hacked (I know in some engines it can be tricky to do, then please ignore). If the program runs without present operations (no QueuePresent, no AcquireNextImage, QueueSubmit is adjusted not to wait semaphore from Acquire and not to signal semaphore for present). I wonder if this scenario also reproduces the error (and incorrect label). So far I tested without presentation.
@artem-lunarg I gave it a quick try and it doesnt seem to happen without present/semaphores.
@artem-lunarg Ive uploaded a sample project that reproduces the issue, checkout the included readme.
@StefanPoelloth Thank you, I can see the assert from out of bounds access. One question, I'm not a .net user, is it possible to continue debugging VVL code after the assert is hit? When I press Retry button in assert window, that usually goes into the debugger for c++ project, here it terminates the app.
@artem-lunarg Im assuming you use visual studio:
Right click on the RenderDemo project and select Properties. Select "Debug" on the left side and click the "Open debug launch profiles UI". Scoll down a bit and check "Enable native code debugging".
this should create the following file: RenderDemo/Properties/launchSettings.json launchSettings.json
It didnt hit an assert for me with a debug build of 9195994 🤔
Thanks, debuging works now. I'm using debug build. The assert I get is exactly what we need, it's the GetHandleRecord
call where array index is out of bounds, in debug builds MSVC std::vector implementation triggers asset.
It didnt hit an assert for me with a debug build of
in the latest code we disabled resource reporting, that's probably the reason why there is no assert and only the validation error.
@artem-lunarg By changing the code very slightly, I was able to produce a WAW instead of RAW. Ive uploaded the repro code for this as well.
Thank you @StefanPoelloth. We successfully reproduced the scenario within our testing framework with a compact unit test. I’ll keel posting updates here when we have fixes.
@StefanPoelloth the fix is landed. If it does not fix the issue in your program, please reopen this ticket.
@artem-lunarg I can confirm this fixes the RAW and WAW false positives. Thanks
Environment:
Describe the Issue
I can provide an api dump privately if needed. The message is invalid/incorrect, it specifies 2 different buffers 0x6eea0c0000000185 and 0x7d8104000000018d which are both alive (not destroyed). The prior_usage specifies that buffer 0x7d8104000000018d was used with vkCmdCopyBuffer, which is wrong. Using the API dump I made sure that no vkCmdCopyBuffer was called for buffer 0x7d8104000000018d (its a mapped buffer, updated with the mapped pointer and used in CmdDispatchIndirect).
Expected behavior
Either a correct sync hazard message or no message.
Valid Usage ID
validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x2d4ecb7c6c0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x2d4ef7fd510[], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 10, resource: VkBuffer 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x2d4ecb7c6c0[], submit: 22, batch: 0, batch_tag: 571, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x2d4ef775860[], seq_no: 5, reset_no: 8, resource: VkBuffer 0x7d8104000000018d[PrimaryCullPass occludedCount]).