KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.79k stars 422 forks source link

Crash when deleting lost device #2270

Open TheMrButcher opened 3 months ago

TheMrButcher commented 3 months ago

I am using MoltenVK on iOS, and I get a lot of VK_ERROR_DEVICE_LOST during work of application. I am trying to add support of this error code and recreate logical device when this happens. The problem is that I can't use vkDeviceWaitIdle to wait for stop of all work. Due to specs vkDeviceWaitIdle can and will return VK_ERROR_DEVICE_LOST if device is lost. And according to code of MoltenVK it is doing nothing on lost device.

So it seems to me that I don't have any way to wait until all command buffer submissions are finished. Without this wait I get a lot of crashes with this trace in thread that is managed by MoltenVK (or Metal) just after deletion of device:

SIGABRT 0x0000000000000000
Pure virtual function called!
abort() called
SWIFT TASK CONTINUATION MISUSE: navigate(to:) leaked its continuation!

Crashed: com.Metal.CompletionQueueDispatch
0  libsystem_kernel.dylib   __pthread_kill + 8
1  libsystem_pthread.dylib  pthread_kill + 208
2  libsystem_c.dylib        abort + 124
3  libc++abi.dylib          __cxxabiv1::__aligned_malloc_with_fallback(unsigned long) + 0
4  libc++abi.dylib          __cxa_deleted_virtual + 0
5  MyApp                    ___ZN13MVKBaseObject13reportMessageEPS_17MVKConfigLogLevelPKcPc + 100
6  MyApp                    ___ZN13MVKBaseObject11reportErrorEPS_8VkResultPKcPc + 180
7  MyApp                    ___ZN13MVKBaseObject11reportErrorE8VkResultPKcz + 36
8  MyApp                    ___ZN31MVKQueueCommandBufferSubmission28commitActiveMTLCommandBufferEb_block_invoke + 196
9  Metal                    MTLDispatchListApply + 44
10 Metal                    -[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:] + 596
11 IOGPU                    -[IOGPUMetalCommandBuffer didCompleteWithStartTime:endTime:error:] + 216
12 Metal                    -[_MTLCommandQueue commandBufferDidComplete:startTime:completionTime:error:] + 132
13 IOGPU                    __54-[IOGPUMetalCommandQueue _submitCommandBuffers:count:]_block_invoke.22 + 164
14 IOGPU                    __IOGPUNotificationQueueSetDispatchQueue_block_invoke + 156
TheMrButcher commented 2 months ago

Any update about this problem?

cdavis5e commented 2 months ago

It looks like there's an object lifetime issue here, where an object has been destroyed, but Metal still has an outstanding command buffer out for it, so when the completion handler tries to report the failure, the app crashes.

I wonder if, once the device is lost, completion handlers for any remaining outstanding command buffers should just exit as quickly as possible, without even trying to report the error to the log. Regardless, we're definitely missing a call to MVKBaseObject::retain() somewhere in MVKQueueCommandBufferSubmission.