KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.79k stars 422 forks source link

Latest MoltenVK instable on iOS and macOS #2283

Open aerofly opened 3 months ago

aerofly commented 3 months ago

Hello,

we tried to use the latest MoltenVK version 1.2.10 in our games. However, this version is quite unstable and crashes a lot compared to the previous release MoltenVK 1.2.9.

We can't pinpoint it down yet on whats causing this. We initially thought it had to do with the introduction of argument buffers, but even setting

mvkConfig.useMetalArgumentBuffers = false;

doesn't help. We still get random crashes either in vkQueueSubmit or in mvkUpdateDescriptorSets.

Any idea what new feature or code change could cause this so we can further investigate the possible causes?

aerofly commented 3 months ago

We tried to pinpoint the issue with the latest MoltenVK version, but the results are very random. Again MoltenVK 1.2.9 works fine with no crashes in our game, but 1.2.10 just doesn't work.

A few crash cases:

1) One crash happens in 'void MVKSamplerDescriptorMixin::write(MVKDescriptorSetLayoutBinding* mvkDSLBind,...' We get this error _mvkSampler->reportError(VK_ERROR_FEATURE_NOT_PRESENT, "vkUpdateDescriptorSets(): Tried to push an immutable sampler."); 2) Then on another occasion we get this:

*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[GCKeyboardInput gpuAddress]: unrecognized selector sent to instance 0x148108700'
*** First throw call stack:
(
    0   CoreFoundation                      0x00000001856fb2ec __exceptionPreprocess + 176
    1   libobjc.A.dylib                     0x00000001851e2788 objc_exception_throw + 60
    2   CoreFoundation                      0x00000001857ad56c -[NSObject(NSObject) __retain_OA] + 0
    3   CoreFoundation                      0x0000000185664f3c ___forwarding___ + 1580
    4   CoreFoundation                      0x0000000185664850 _CF_forwarding_prep_0 + 96
    5   Aerofly FS 4                        0x0000000101ce8894 _ZN22MVKMetalArgumentBuffer9setBufferEPU19objcproto9MTLBuffer11objc_objectmj + 72
    6   Aerofly FS 4                        0x0000000101d774b4 _ZN19MVKBufferDescriptor5writeEP29MVKDescriptorSetLayoutBindingP16MVKDescriptorSetjjmPKv + 252
    7   Aerofly FS 4                        0x0000000101cee3b0 _ZN16MVKDescriptorSet5writeI20VkWriteDescriptorSetEEvPKT_mPKv + 384
    8   Aerofly FS 4                        0x0000000101cedfbc _Z23mvkUpdateDescriptorSetsjPK20VkWriteDescriptorSetjPK19VkCopyDescriptorSet + 120
    9   Aerofly FS 4                        0x0000000101cfe674 vkUpdateDescriptorSets + 80

3) And another one in 'class MVKDescriptorSet : public MVKVulkanAPIDeviceObject {'

bool hasMetalArgumentBuffer() { return _layout->isUsingMetalArgumentBuffers(); };

In this case, the _layout pointer is zero.

Any idea what might be wrong here? Our game passes Vulkan validation on Windows and Linux and we don't observe any issues with MoltenVK 1.2.9.

Our development machine is an Apple M1 MacBook Pro with the latest macOS version.

aerofly commented 3 months ago

Another crash happens if I disable metal argument buffers. I then get a

[mvk-error] VK_ERROR_OUT_OF_DEVICE_MEMORY: MTLCommandBuffer "vkQueueSubmit MTLCommandBuffer on Queue 0-0" execution failed (code 3): Caused GPU Address Fault Error (0000000b:kIOGPUCommandBufferCallbackErrorPageFault)
[mvk-info] Encoders for 0x6000022b6940 "vkQueueSubmit MTLCommandBuffer on Queue 0-0":
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed
[mvk-info]  - vkCmdDispatch ComputeEncoder: completed
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: affected
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: affected
[mvk-info]  - vkCmdBeginRenderPass RenderEncoder: completed`

As it happens so randomly, it's hard to further debug this issue. So any idea would be appreciated.

billhollings commented 3 months ago

Is there some way that we can replicate this locally? Do you have a game demo we can try here?

billhollings commented 3 months ago

Without the ability to test your app, I'm somewhat stabbing in the dark, but PR #2293 may fix some or all of the issues reported here.

Please retest with that PR included, and report results back here. Thanks.

aerofly commented 3 months ago

Thank you for your reply. Unfortunately it's difficult to send you something to check and debug it locally on your side.

I also downloaded the latest PR #2293, but the crashes still happen but there is no clear point on where it happens. One crash log for example is this

CVDisplayLink (10)#0    0x000000018556ea60 in __pthread_kill ()
#1  0x0000000105052be8 in pthread_kill ()
#2  0x00000001854b3a30 in abort ()
#3  0x000000018555dd08 in abort_message ()
#4  0x000000018554dfa4 in demangling_terminate_handler() ()
#5  0x00000001851ec1e0 in _objc_terminate() ()
#6  0x000000018555d0cc in std::__terminate(void (*)()) ()
#7  0x0000000185560348 in __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) ()
#8  0x000000018556028c in __cxa_throw ()
#9  0x00000001854dea28 in std::__1::__throw_system_error(int, char const*) ()
#10 0x00000001854d272c in std::__1::mutex::lock() ()
#11 0x00000001074dc5f4 in std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.5.sdk/usr/include/c++/v1/__mutex/lock_guard.h:35
#12 0x00000001074d9438 in std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.5.sdk/usr/include/c++/v1/__mutex/lock_guard.h:34
#13 0x0000000107587740 in MVKBufferView::getMTLTexture() at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKBuffer.mm:292
#14 0x00000001075e1b9c in MVKTexelBufferDescriptor::write(MVKDescriptorSetLayoutBinding*, MVKDescriptorSet*, unsigned int, unsigned int, unsigned long, void const*) at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:1295
#15 0x00000001074bfbd8 in void MVKDescriptorSet::write<VkWriteDescriptorSet>(VkWriteDescriptorSet const*, unsigned long, void const*) at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:444
#16 0x00000001074bf7c0 in mvkUpdateDescriptorSets(unsigned int, VkWriteDescriptorSet const*, unsigned int, VkCopyDescriptorSet const*) at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:1116
#17 0x00000001074f0d68 in vkUpdateDescriptorSets at .../MoltenVK/MoltenVK/Vulkan/vulkan.mm:1257

And here another one:

CVDisplayLink (10)#0    0x00000001851f02d4 in class_data_bits_t::setData(class_rw_t*) ()
#1  0x00000001851cd82c in realizeClassWithoutSwift(objc_class*, objc_class*) ()
#2  0x00000001851efe40 in realizeClassMaybeSwiftMaybeRelock(objc_class*, locker_mixin<lockdebug::lock_mixin<objc_lock_base_t>>&, bool) ()
#3  0x00000001851d2564 in lookUpImpOrForward ()
#4  0x00000001851d1f64 in _objc_msgSend_uncached ()
#5  0x0000000108c98534 in MVKMetalArgumentBuffer::setSamplerState(id<MTLSamplerState>, unsigned int) at /..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:72
#6  0x0000000108dc12e0 in MVKSamplerDescriptorMixin::write(MVKDescriptorSetLayoutBinding*, MVKDescriptorSet*, unsigned int, unsigned int, unsigned long, void const*) at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:1131
#7  0x0000000108dc1690 in MVKCombinedImageSamplerDescriptor::write(MVKDescriptorSetLayoutBinding*, MVKDescriptorSet*, unsigned int, unsigned int, unsigned long, void const*) at ..../MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:1215
#8  0x0000000108c9fbd8 in void MVKDescriptorSet::write<VkWriteDescriptorSet>(VkWriteDescriptorSet const*, unsigned long, void const*) at /Users/aerofly/bubu/bubu_molten_vk_test_dont_delete/extern/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:444
#9  0x0000000108c9f7c0 in mvkUpdateDescriptorSets(unsigned int, VkWriteDescriptorSet const*, unsigned int, VkCopyDescriptorSet const*) at /Users/aerofly/bubu/bubu_molten_vk_test_dont_delete/extern/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:1116
#10 0x0000000108cd0d68 in vkUpdateDescriptorSets at ..../Vulkan/vulkan.mm:1257

What puzzles me a little is the faxct that the call stack shows 'MVKMetalArgumentBuffer::setSamplerState' even though I disable the MoltenVK argument buffer support uses this code:

  const char *layer_name = kMVKMoltenVKDriverLayerName;

  const VkBool32 setting_true  = VK_TRUE;
  const VkBool32 setting_false = VK_FALSE;
  const uint32_t setting_prefill_metal_command_buffers = MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS_STYLE_NO_PREFILL;
  const uint32_t setting_semaphore_support_style       = MVK_CONFIG_VK_SEMAPHORE_SUPPORT_STYLE_METAL_EVENTS_WHERE_SAFE;
  const uint32_t setting_log_level = 0;

  const VkLayerSettingEXT settings[] =
  {
    { layer_name, "MVK_CONFIG_DEBUG",                                VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_DISPLAY_WATERMARK",                    VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_FAST_MATH_ENABLED",                    VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_SYNCHRONOUS_QUEUE_SUBMITS",            VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS",           VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_false },
    { layer_name, "MVK_CONFIG_SHADER_CONVERSION_FLIP_VERTEX_Y",      VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_FULL_IMAGE_VIEW_SWIZZLE",              VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_SWAPCHAIN_MIN_MAG_FILTER_USE_NEAREST", VK_LAYER_SETTING_TYPE_BOOL32_EXT, 1, &setting_true },
    { layer_name, "MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS",        VK_LAYER_SETTING_TYPE_INT32_EXT,  1, &setting_prefill_metal_command_buffers },
    { layer_name, "MVK_CONFIG_VK_SEMAPHORE_SUPPORT_STYLE",           VK_LAYER_SETTING_TYPE_INT32_EXT,  1, &setting_semaphore_support_style },
    { layer_name, "MVK_CONFIG_LOG_LEVEL",                            VK_LAYER_SETTING_TYPE_INT32_EXT,  1, &setting_log_level }
  };

So it seems like metal argument buffers are still used even though I changed the config value? I do get the same crashes if I enable the Metal argument buffers.

Maybe you can point me in any direction of what might cause these instabilities?

I would like to point out, it happens on macOS and iOS.

billhollings commented 3 months ago

Looks like some slightly different errors now.

All the errors we've been dealing with are happening in vkUpdateDescriptorSets(), and based on your description of behaviour, are race-condition errors.

The big change in vkUpdateDescriptorSets() behaviour from 1.2.9 to 1.2.10 is that under 1.2.9 the Vulkan resource objects were simply copied into internal descriptor objects, but were not otherwise interacted with, whereas, 1.2.10 also interacts with the objects to retrieve the Metal resources to insert into the Metal argument buffer. It's this interaction with the Vulkan objects that seems to be the source of all these errors.

Since these are race-condition errors, one possibility is that a Vulkan resource object may have been destroyed before being submitted to vkUpdateDescriptorSets(). This wouldn't be a problem in 1.2.9 because those objects are not interacted with, but it is a problem in 1.2.10, where they are interacted with.

Can you investigate in your code whether it's possible that Vulkan resource objects may be destroyed before being submitted to vkUpdateDescriptorSets()?

The bool hasMetalArgumentBuffer() { return _layout->isUsingMetalArgumentBuffers(); }; error you mention above may also be caused by a descriptor set being freed before being submitted to vkUpdateDescriptorSets().


Thanks for pointing out the problem with the config settings not being respected when set via VK_EXT_layer_settings. I've confirmed that is a problem, particularly for MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS.

~I'll add a fix for that in a separate PR. In the meantime, you can set MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS as an environment variable when launching the app.~

PR #2294 fixes this config issue.

aerofly commented 2 months ago

Sorry for the late reply. I downloaded the latest MoltenVK now and the VK_EXT_layers_settings now works as advertised.

If I run our game now with MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS disabled ( using VK_EXT_layers_settings ), everthing now works fine and we do not observe any crashes on macOS and iOS.

However if I enable MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS I still observe the crashes.

So it seems like our crashes are indeed linked to the new metal argument buffer support.

If I find some more time, we will investigate this issue further and trying to take into account what you said in your previous posting.