KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.74k stars 414 forks source link

Frequent MVKCmdWaitEvents crashes #2319

Open tycho opened 2 weeks ago

tycho commented 2 weeks ago

This is happening on an Apple M3 Max with MoltenVK rev dad38515, ANGLE rev d51c251604, macOS 15.0 Beta (24A5320a), Xcode 16.0 beta 6 (16A5230g).

A few different crashes happening in MVKCmdWaitEvents due to a null pointer dereference somewhere, somehow:

https://sentry.uplinklabs.net/share/issue/23d442788de24281a469520d0007b504/ https://sentry.uplinklabs.net/share/issue/6c38cdefc0994c4dbb83bcc3d1d9bb22/ https://sentry.uplinklabs.net/share/issue/3d30f207a90340a18b2ac9c771be4c48/

Also happens on an earlier game build with the same MVK revision on an Apple M2 Max:

https://sentry.uplinklabs.net/share/issue/791bf76a3dc64c93a90aa7e219d57c39/

(In the above stack traces click the "Show N more frames" link to see where it died within MVK.)

It seems pretty easy to repro, it happens on game startup. I'd be happy to share a game binary for developers to debug the issue, but I would need to share the build privately. The game demo builds which I could share publicly do not have the "bootloaders" which retail/debug builds have and are triggering the issue.

Running with MTL_DEBUG_LAYER=1 in environment yields one of these two crashes:

-[IOGPUMetalCommandBuffer encodeWaitForEvent:value:timeout:]:494: failed assertion `encodeWaitForEvent:value: with uncommitted encoder'
-[MTLDebugRenderCommandEncoder setVertexBufferOffset:attributeStride:atIndex:]:1875: failed assertion `Set Vertex Buffer Offset Validation
index(28) must have an existing buffer.

This didn't happen around a month ago when I last built MoltenVK + ANGLE. Not sure where the blame lies exactly but it has been stable on other platforms using ANGLE's Vulkan backend, so I suspect it's MVK breaking in some way.

tycho commented 2 weeks ago

The second validation error I mentioned,

-[MTLDebugRenderCommandEncoder setVertexBufferOffset:attributeStride:atIndex:]:1875: failed assertion `Set Vertex Buffer Offset Validation
index(28) must have an existing buffer.

occurs when MVKMTLBufferBinding::justOffset is true, and this dodges that particular validation error:

diff --git a/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h b/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
index 97ddf4ac..942b393b 100644
--- a/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
+++ b/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
@@ -89,7 +89,7 @@ typedef struct MVKMTLBufferBinding {
         } else if (offset != other.offset || stride != other.stride) {
             offset = other.offset;
                        stride = other.stride;
-            justOffset = !isOverridden && (!isDirty || justOffset);
+            justOffset = false; //!isOverridden && (!isDirty || justOffset);
                        isOverridden = false;
             isDirty = true;
         }

Obviously not the right solution, but maybe someone can figure out why _mtlRenderEncoder setVertexBuffer hasn't been called for that index before it tries to do _mtlRenderEncoder setVertexBufferOffset. I'm not clear on how it does the state tracking for what's already bound at the appropriate indices.

Still digging into the first validation error, but it's harder to hit in debug builds.

tycho commented 2 weeks ago

For the encodeWaitForEvent crash, this seems to get around it but I think there's got to be more to it:

diff --git a/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm b/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
index 697cca24..d750dcd6 100644
--- a/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
+++ b/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
@@ -617,6 +617,7 @@ VkResult MVKCmdWaitEvents<N>::setContent(MVKCommandBuffer* cmdBuff,

 template <size_t N>
 void MVKCmdWaitEvents<N>::encode(MVKCommandEncoder* cmdEncoder) {
+       cmdEncoder->endCurrentMetalEncoding();
        for (MVKEvent* mvkEvt : _mvkEvents) {
                mvkEvt->encodeWait(cmdEncoder->_mtlCmdBuffer);
        }
billhollings commented 6 days ago

A few different crashes happening in MVKCmdWaitEvents due to a null pointer dereference somewhere, somehow:

Unfortunately, I'm not able to open the first two links above. Can you regenerate and repost them, please? The 3rd & 4th links are working.

Screenshot 2024-09-09 at 8 12 49 PM