KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
740 stars 398 forks source link

Crash for VideoSession #8267

Open locke-lunarg opened 1 month ago

locke-lunarg commented 1 month ago

Environment:

Describe the Issue Sample code: https://github.com/nvpro-samples/vk_video_samples I used gfxr to capture vk_video_encoder and vk_video_decoder, and then replay. It caused on ValidationLayers. It could replay successfully if VVL is disabled.

Here is the capture files. It might help to re-generate the issues easily. gfxr_vk-video-enc-test-n3070.zip gfxr_vk-video-dec-test-n3070.zip

For vk_video_encoder (gfxr_vk-video-enc-test-n3070): It crashed on VideoSessionDeviceState::IsSlotActive. is_active_.size() is 1, but slot_index is 16. The stacks

Expression: vector<bool> subscript out of range

VkLayer_khronos_validation.dll!vvl::VideoSessionDeviceState::IsSlotActive(int slot_index) Line 430  C++
VkLayer_khronos_validation.dll!CoreChecks::PreCallRecordCmdBeginVideoCodingKHR::__l9::<lambda_1>::operator()(const ValidationStateTracker & dev_data, const vvl::VideoSession * vs_state, vvl::VideoSessionDeviceState & dev_state, bool do_validate) Line 3567 C++
    [External Code] 
VkLayer_khronos_validation.dll!CommandBufferSubmitState::Validate(const Location & loc, const vvl::CommandBuffer & cb_state, unsigned int perf_pass) Line 96    C++
VkLayer_khronos_validation.dll!CoreChecks::PreCallValidateQueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence, const ErrorObject & error_obj) Line 166    C++
VkLayer_khronos_validation.dll!vulkan_layer_chassis::QueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence) Line 1396   C++
gfxrecon-replay.exe!gfxrecon::decode::VulkanReplayConsumerBase::OverrideQueueSubmit(VkResult(*)(VkQueue_T *, unsigned int, const VkSubmitInfo *, VkFence_T *) func, unsigned __int64 index, VkResult original_result, const gfxrecon::decode::QueueInfo * queue_info, unsigned int submitCount, const gfxrecon::decode::StructPointerDecoder<gfxrecon::decode::Decoded_VkSubmitInfo> * pSubmits, const gfxrecon::decode::FenceInfo * fence_info) Line 3344  C++

For vk_video_decoder (gfxrvk-video-dec-test-n3070): It crashed on `completed.set_value()ofvvl::Fence::Retire(). The value of std::future_error ispromise_already_satisfied`.

The shared state already stores a value or exception. The error category is set to [promise_already_satisfied](https://omegaup.com/docs/cpp/en/cpp/thread/future_errc.html).

The stacks

Exception thrown at 0x00007FFE50B7F20C in gfxrecon-replay.exe: Microsoft C++ exception: std::future_error at memory location 0x00000040B80FF310.

VkLayer_khronos_validation.dll!std::promise<void>::set_value() Line 1210    C++
VkLayer_khronos_validation.dll!vvl::Fence::Retire() Line 86 C++
VkLayer_khronos_validation.dll!vvl::Queue::Retire(vvl::QueueSubmission & submission) Line 212   C++
VkLayer_khronos_validation.dll!vvl::Queue::ThreadFunc() Line 225    C++
artem-lunarg commented 1 month ago

I assigned synchronization label because of queue thread but can also be video specific.

spencer-lunarg commented 1 month ago

cc @aqnuep for heads up

aqnuep commented 1 month ago

I don't think that GFXReconstruct supports video capture/replay, so maybe the issue lies there.

Looking at the place the issue is triggered, it does seem that the replay uses an out-of-bounds DPB slot index.

Sure, arguably we could add an additional bounds check somewhere in VVL that should solve this issue (I'll make sure to do so, although I'm surprised this could even happen, considering that there is supposed to be bounds checking for that), but the capture certainly does something illegal or the capture itself is not correct (maybe negative slot index is somehow stored in an unsigned value at some point).

aqnuep commented 1 month ago

I've created a repro case and it seems it confirms my suspicion that this out-of-bounds access should never occur in a normal situation, but I'll post a PR with the new test case and additional bounds-check nonetheless.

This seems like a capture/replay issue though.

aqnuep commented 1 month ago

I've created a PR to add more bounds-checks, but I still think this is a capture/replay issue.

Don't forget that video session objects are stateful objects, with device state, so simple capture and replay may not just work as you'd expect it, even if it does not crash the drivers.

spencer-lunarg commented 1 month ago

@locke-lunarg with the bound checks, the latest VVL will probably not crash there anymore... I assume we can close this issue then?

locke-lunarg commented 1 month ago

gfxr_vk-video-enc-test-n3070 is good. But gfxrvk-video-dec-test-n3070 still crashed on `completed.set_value()ofvvl::Fence::Retire()`.

spencer-lunarg commented 1 month ago

just to confirm, does gfxr_vk-video-dec-test-n3070 run back successfully without validation turned on?

locke-lunarg commented 1 month ago

Yes. For now, I just removed completed_.set_value() to replay with vvl. It helped.

artem-lunarg commented 1 month ago

Going to add synchronization label back, it looks like there is still issue there.