Array wrapper may cause a GPU crash?

js6i commented 2 months ago

This is something I found while working on https://github.com/KhronosGroup/MoltenVK/pull/2199. Enabling variable sized arrays caused a bunch of vkd3d tests that otherwise worked fine (modulo validation errors due to declaring huge arrays and binding small buffers to them) to crash the GPU (M1). I tried a few things, like using the regular array<> type instead of spvDescriptor* wrappers, but what ended up fixing the crash was making the entry point variables plain pointers.

Below is one of the kernels that crashed (from vkd3d d3d12 test_register_space), with the plain pointer version #if'd inline:

#pragma clang diagnostic ignored "-Wmissing-prototypes"
#pragma clang diagnostic ignored "-Wunused-variable"

#include <metal_stdlib>
#include <simd/simd.h>
#include <metal_atomic>

using namespace metal;

template<typename T>
struct spvDescriptor
{
    T value;
};

template<typename T>
struct spvDescriptorArray
{
    spvDescriptorArray(const device spvDescriptor<T>* ptr) : ptr(ptr)
    {
    }
    const device T& operator [] (size_t i) const
    {
        return ptr[i].value;
    }
    const device spvDescriptor<T>* ptr;
};

struct push_cb_struct
{
    float4 cb0[1];
    float4 cb1[1];
    uint _m2[1];
};

struct spvDescriptorSetBuffer1
{
    spvDescriptor<texture_buffer<uint>> t0 [[id(0)]][1] /* unsized array hack */;
};

struct spvDescriptorSetBuffer3
{
    spvDescriptor<texture_buffer<uint, access::write>> u0 [[id(0)]][1] /* unsized array hack */;
    // Overlapping binding: spvDescriptor<texture_buffer<uint, access::write>> u2 [[id(0)]][1] /* unsized array hack */;
};

struct spvDescriptorSetBuffer6
{
    spvDescriptor<texture_buffer<uint, access::read_write>> u6 [[id(0)]][1] /* unsized array hack */;
};

kernel void main0(const device spvDescriptorSetBuffer1& spvDescriptorSet1 [[buffer(1)]], const device spvDescriptorSetBuffer3& spvDescriptorSet3 [[buffer(3)]], const device spvDescriptorSetBuffer6& spvDescriptorSet6 [[buffer(6)]], constant push_cb_struct& push_cb [[buffer(9)]])
{
#if CRASHES
    spvDescriptorArray<texture_buffer<uint>> t0 {spvDescriptorSet1.t0};
    spvDescriptorArray<texture_buffer<uint, access::write>> u0 {spvDescriptorSet3.u0};
    spvDescriptorArray<texture_buffer<uint, access::write>> u2 {reinterpret_cast<const device spvDescriptor<texture_buffer<uint, access::write>>*>(&spvDescriptorSet3.u0)};
    spvDescriptorArray<texture_buffer<uint, access::read_write>> u6 {spvDescriptorSet6.u6};
#else // WORKS
    const device texture_buffer<uint> *t0 = &(spvDescriptorSet1.t0[0].value);
    const device texture_buffer<uint, access::write> *u0 = &(spvDescriptorSet3.u0[0].value);
    const device texture_buffer<uint, access::write> *u2 = &(reinterpret_cast<const device spvDescriptor<texture_buffer<uint, access::write>>*>(&spvDescriptorSet3.u0)[0].value);
    const device texture_buffer<uint, access::read_write> *u6 = &(spvDescriptorSet6.u6[0].value);
#endif
    float4 r0;
    r0.x = as_type<float>(t0[0u + push_cb._m2[0u]].read(uint((uint(0) >> 2u))).x);
    u0[6u + push_cb._m2[0u]].write(uint4(as_type<uint>(r0.x)), uint((0u >> 2u)));
    r0.x = as_type<float>(t0[1u + push_cb._m2[0u]].read(uint((uint(0) >> 2u))).x);
    u0[7u + push_cb._m2[0u]].write(uint4(as_type<uint>(r0.x)), uint((0u >> 2u)));
    r0.x = as_type<float>(t0[2u + push_cb._m2[0u]].read(uint(0)).x);
    u0[9u + push_cb._m2[0u]].write(uint4(as_type<uint>(r0.x)), uint((0u >> 2u)));
    r0.x = as_type<float>(t0[3u + push_cb._m2[0u]].read(uint(0)).x);
    u0[8u + push_cb._m2[0u]].write(uint4(as_type<uint>(r0.x)), uint((0u >> 2u)));
    r0.x = as_type<float>(t0[5u + push_cb._m2[0u]].read(uint(((uint(0) * 1u) + (uint(0) >> 2u)))).x);
    u2[10u + push_cb._m2[0u]].write(as_type<uint4>(r0.xxxx), uint(0));
    r0.x = as_type<float>(t0[4u + push_cb._m2[0u]].read(uint(((uint(0) * 1u) + (uint(0) >> 2u)))).x);
    u2[11u + push_cb._m2[0u]].write(as_type<uint4>(r0.xxxx), uint(0));
    u0[12u + push_cb._m2[0u]].write(uint4(as_type<uint>(push_cb.cb0[0u].x)), uint(((uint(0) * 1u) + (uint(0) >> 2u))));
    u0[13u + push_cb._m2[0u]].write(uint4(as_type<uint>(push_cb.cb1[0u].x)), uint(((uint(0) * 1u) + (uint(0) >> 2u))));
    uint _175 = u6[12u + push_cb._m2[0u]].atomic_fetch_add(0u, 1).x;
    r0.x = as_type<float>(_175);
    uint _181 = u6[13u + push_cb._m2[0u]].atomic_fetch_sub(0u, 1).x;
    r0.x = as_type<float>(_181 - 1u);
}

The crash is vkQueueSubmit MTLCommandBuffer on Queue 3-0" execution failed (code 3): Caused GPU Address Fault Error (0000000b:kIOGPUCommandBufferCallbackErrorPageFault

HansKristian-Work commented 2 months ago

This is impossible for me to debug. I don't have any Apple hardware.

Try commented 2 months ago

Hi, @js6i !

Unless there were major changes, since I've worked on spvDescriptorArray, it should not be used in context of emulation-layers such as MoltenVK. The goal of this class is to map runtime-sized array from GLSL to Metal3 argument buffer. Code bellow makes no sense then:

struct spvDescriptorSetBuffer1
{
    spvDescriptor<texture_buffer<uint>> t0 [[id(0)]][1] /* unsized array hack */;
};

This is a mix of T1-argument-buffer(spvDescriptorSetBuffer1) and T2 via spvDescriptor.

AFAIR what you need to to set following setting of compiler:

msl_options.argument_buffers      = true; // emulate descriptor sets via ABuffer
msl_options.argument_buffers_tier = Options::ArgumentBuffersTier::Tier1; // disallow Tier2

js6i commented 2 months ago

@Try thanks for chiming in. Are you saying that msl_options.argument_buffers are inherently in conflict with the Tier2 setting? MoltenVK needs variable sized arrays supported as well - is there a reason to require disabling msl_options.argument_buffers, or maybe we should just not emit the spvDescriptor wrappers in this case?

Besides, the generated code seems fine other than the weird bug that I addressed in #2314.

Try commented 2 months ago

Are you saying that msl_options.argument_buffers are inherently in conflict with the Tier2 setting?

There were not intended to be uses together. Yet, have to say that I didn't followed spirv-cross for several months, and thing maybe different now. Maybe @billhollings knows better, than I am, if there is a use case for argument_buffers + Tier2 in MoltenVK.

In Metal T1-argument buffer is close analog to Vulkan descriptor-set, and T2 - more like descriptor-buffer, allowing pointer to other buffers/textures within the buffer. Here, IMAO is misleading naming has place. msl_options.argument_buffers de facto mean "use T1-argument buffer to emulate vulkan-descriptor set". And argument_buffers_tier=Tier2, mean to emit array of descriptors as T2-argument buffer.

MoltenVK needs variable sized arrays supported as well

Don't think it needed to emulate vulkan. Vulkan descriptor-indexing model always requires to declare upper bound for each runtime-sized array, on C++ side. Even if VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT is in use.

js6i commented 2 months ago

Don't think it needed to emulate vulkan. Vulkan descriptor-indexing model always requires to declare upper bound for each runtime-sized array, on C++ side. Even if VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT is in use.

Yeah, the issue with that is that then binding a smaller buffer (and the descriptor pool only needs to be large enough to fit the variable count) triggers a validation error.

HansKristian-Work commented 2 months ago

Workaround is merged, so should be fine now.

KhronosGroup / SPIRV-Cross

Array wrapper may cause a GPU crash? #2308