Sometimes crashing during animated layer resize

orion1vi commented 11 months ago

Getting crash when addPresentedHandler is called at:

https://github.com/KhronosGroup/MoltenVK/blob/568cc3acc0e2299931fdaecaaa1fc3ec5b4af281/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm#L1321

Thread 2 Queue : com.Metal.CompletionQueueDispatch (serial)
#0  0x000000010b2e9678 in unsigned int std::__1::__cxx_atomic_fetch_add<unsigned int>(std::__1::__cxx_atomic_base_impl<unsigned int>*, unsigned int, std::__1::memory_order) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/atomic:1050
#1  0x000000010b2e95d1 in std::__1::__atomic_base<unsigned int, true>::fetch_add(unsigned int, std::__1::memory_order) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/atomic:1719
#2  0x000000010b2e7672 in std::__1::__atomic_base<unsigned int, true>::operator++(int) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/atomic:1748
#3  0x000000010b2e7645 in MVKSwapchain::beginPresentation(MVKImagePresentInfo const&) at /Users/orion/Developer/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKSwapchain.mm:248
#4  0x000000010b17f131 in MVKPresentableSwapchainImage::beginPresentation(MVKImagePresentInfo const&) at /Users/orion/Developer/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm:1403
#5  0x000000010b17ed9c in MVKPresentableSwapchainImage::addPresentedHandler(id<CAMetalDrawable>, MVKImagePresentInfo, MVKSwapchainSignaler) at /Users/orion/Developer/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm:1384
#6  0x000000010b17ec8a in invocation function for block in MVKPresentableSwapchainImage::presentCAMetalDrawable(id<MTLCommandBuffer>, MVKImagePresentInfo) at /Users/orion/Developer/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm:1321
#7  0x00007fff20d1e322 in -[MTLToolsCommandBuffer invokeScheduledHandlers] ()
#8  0x00007fff2856b292 in MTLDispatchListApply ()
#9  0x00007fff2856b1f4 in -[_MTLCommandBuffer didScheduleWithStartTime:endTime:error:] ()
#10 0x00007fff28543cd6 in ioAccelCommandQueueBlockFenceCallback ()
#11 0x00007fff22d816cb in IODispatchCalloutFromCFMessage ()
#12 0x00007fff22d8154c in _IODispatchCalloutWithDispatch ()
#13 0x000000011052eee3 in dispatch_mig_server ()
#14 0x000000011050f264 in _dispatch_client_callout ()
#15 0x00000001105120b8 in _dispatch_continuation_pop ()
#16 0x000000011052864a in _dispatch_source_invoke ()
#17 0x00000001105165ee in _dispatch_lane_serial_drain ()
#18 0x00000001105176ba in _dispatch_lane_invoke ()
#19 0x00000001105252c4 in _dispatch_workloop_worker_thread ()
#20 0x00000001105b2b0d in _pthread_wqthread ()
#21 0x00000001105b1ae3 in start_wqthread ()

I'm suspecting 9f64faadbcf490e73e69db8bc3e10154e61f17e5 and f0cb31a12b59f05177f07ab5a46bc9084ba5fbc9. If I remove all commits inclusive from 9f64faadbcf490e73e69db8bc3e10154e61f17e5 I cannot reproduce the crash. If I remove all commits from 9f64faadbcf490e73e69db8bc3e10154e61f17e5, but keep 9f64faadbcf490e73e69db8bc3e10154e61f17e5 a28437d8f21dff45563eaa550a8331698a32babb 7fe4963985d8ae44159243d8babff25cf830bca7 f0cb31a12b59f05177f07ab5a46bc9084ba5fbc9 I can reproduce the crash. Removing just f0cb31a12b59f05177f07ab5a46bc9084ba5fbc9 breaks sizing completely with [mvk-error] VK_TIMEOUT: MTLCommandBuffer "vkQueueSubmit MTLCommandBuffer on Queue 0-0" execution failed (code 2): Caused GPU Timeout Error (IOAF code 2).

Crash is difficult to reproduce but the way I'm reproducing is by spamming animated window resize (Window->Zoom with assigned key bind) where layer frame is auto resized to window frame, so autoresizingMask must be set, I use [.layerWidthSizable, .layerHeightSizable], but could reproduce with other masks. If I enable "Reduce motion" in Accessibility preferences, window zoom becomes non animated and I can't reproduce the crash.

Unfortunately I can't reproduce it with cube and can't provide any simple example application.

Tested on macOS 11.7 only.

orion1vi commented 9 months ago

When crash happens in beginPresentation state looks like this: swapchain and device are used in that function but they're NULL so it crashes. Moving addPresentedHandler outside of addScheduledHandler solved this for me.

cdavis5e commented 9 months ago

Er, uh, why did you close this? If there's a problem with our code, and you have a fix for it, you should probably submit it to this repo. We'd really appreciate that.

orion1vi commented 9 months ago

Because it was moved inside here: https://github.com/KhronosGroup/MoltenVK/commit/f0cb31a12b59f05177f07ab5a46bc9084ba5fbc9#diff-ee7f9cb03a11bba9ad9e2f634103f46908e74a8e5e630894a0a94340f312fd1bR1329 But if @billhollings wouldn't see a problem with moving it back outside the handler...

billhollings commented 9 months ago

It looks like you have a race condition, where you are occasionally destroying the swapchain before the command buffer that is presenting its images has finished executing, or in your case, hasn't been scheduled into the GPU (which would happen almost immediately after submission to the queue).

Are you performing a vkDeviceWaitIdle() after you detect the resize, but before you destroy the old swapchain? See demo_resize() in cube.c in the Cube demo for their use of it.

billhollings commented 1 month ago

Were you are running with MVK_CONFIG_SYNCHRONOUS_QUEUE_SUBMITS=0 (disabled)?

If the current code is still causing this problem, PR #2297 may fix this. Please test again with that update, and re-close this issue if it fixes the problem.

KhronosGroup / MoltenVK

Sometimes crashing during animated layer resize #2031