KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.79k stars 423 forks source link

MoltenVK in VulkanSDK 1.2.162.0 and onwards causes deadlocks if sample shading is enabled for a pipeline. #1311

Open kondrak opened 3 years ago

kondrak commented 3 years ago

Hi!

I recently updated my project to the latest VulkanSDK on a Mac and ran into an issue when using shaders which use gl_FragCoord - whenever an attempt to create a pipeline using such a shader is made, the vkCreateGraphicsPipelines never returns once it's called. There are no validation errors or anything else that would indicate a spec-violation or some other problem. This issue doesn't occur on Windows and Linux and the linked changelog in 1.2.162.0 to MoltenVK points to some commits relating to work being done on FragCoord, so I'm guessing this could be related somehow. The project I'm referring to is vkQuake2: https://github.com/kondrak/vkQuake2 and the problematic shader is in shaders/world_warp.frag but the issue occurs in any other shader whenever I add a reference to gl_FragCoord in it. Below is an example of the most simple shader in the project that causes the problem (it's a modified version of shaders/basic.frag):

#version 450

layout(set = 0, binding = 0) uniform sampler2D sTexture;

layout(location = 0) in vec2 texCoord;
layout(location = 1) in vec4 color;
layout(location = 2) in float aTreshold;

layout(location = 0) out vec4 fragmentColor;

void main()
{
    vec2 uv = gl_FragCoord.xy;
    fragmentColor = texture(sTexture, uv);
}

When a pipeline using above fragment shader is being created, calling vkCreateGraphicsPipelines causes it to never return and the program "locks", in this particular case in the QVk_CreatePipeline function. Problem doesn't occur when using VulkanSDK 1.2.154 or lower.

I wanted to prepare a sample app using either Khronos samples or anything else available but nothing seems to be working out of the box on a Mac, so hopefully this information is enough for now. vkQuake2 can be quickly built either using command line or XCode, so hopefully it should be easy to launch and verify the problem.

billhollings commented 3 years ago

I am not able to replicate this issue.

I tested this on the Vulkan-Samples texture_loading sample, by modifying the fragment shader line:

vec4 color = texture(samplerColor, inUV, inLodBias);

to:

vec2 uv = gl_FragCoord.xy;
vec4 color = texture(samplerColor, uv, inLodBias);

I've tested using Vulkan SDK 1.2.170, both with the included MoltenVK version, and using the latest version MoltenVK from the repository, on 3 test environments:

MacBook Air M1 (Apple Silicon) MacBook Pro Radeon MacBook Pro Intel HD 630

Is there some other way to provide an environment that causes this?

kondrak commented 3 years ago

I'll try to find a repro on a smaller scale application.

kondrak commented 3 years ago

I haven't checked a different sample yet but I see that on some other users' machines (specifically an Apple M1 in this case) the problem also doesn't occur with my project, so until I verify it, it seems to be an issue isolated to one specific configuration. The Mac I'm running this on is a 2014 Macbook Pro with Intel ~HD 630~ Iris Pro, running OSX 10.15.

kondrak commented 3 years ago

I checked the samples and do not get a repro either, so at this stage I'll be looking at what difference might cause the issue on my side. Let's leave this open for now if it's ok, once I have some time I will drill down a bit more and report here, so that others with similar issue might benefit from it.

kondrak commented 3 years ago

FWIW, it seems that whatever is causing this problem is an unmet wait for a conditional variable somewhere in vulkan_layer_chassis::CreateGraphicsPipelines that is never set/unset - at least based on the asm output in the debugger.

billhollings commented 3 years ago

somewhere in vulkan_layer_chassis::CreateGraphicsPipelines

Sounds like you have a number of layers in play. If you are able to build and link a Debug mode version of MoltenVK, that might help you tunnel into it further, if the issue is down in the MoltenVK code.

kondrak commented 3 years ago

Thanks! I'll try it out

kondrak commented 3 years ago

I just recompiled my project with latest MoltenVK debug target and this is what I get:

void MVKMetalCompiler::compile(unique_lock<mutex>& lock, dispatch_block_t block) {
    MVKAssert( _startTime == 0, "%s compile occurred already in this instance. Instances of %s should only be used for a single compile activity.", _compilerType.c_str(), getClassName().c_str());

    MVKDevice* mvkDev = _owner->getDevice();
    _startTime = mvkDev->getPerformanceTimestamp();

    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{ @autoreleasepool { block(); } });

    // Limit timeout to avoid overflow since wait_for() uses wait_until()
    chrono::nanoseconds nanoTimeout(min(mvkConfig()->metalCompileTimeout, kMVKUndefinedLargeUInt64));
    _blocker.wait_for(lock, nanoTimeout, [this]{ return _isCompileDone; });

Whenever a shader has gl_FragCoord used in it, this particular function hangs at the last _blocker.wait_for call - I don't have any insight why this happens or what could be causing it yet.

kondrak commented 3 years ago

The exact stacktrace whenever this happens:

#0  0x000000010f9e72bf in MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_3::operator()() const at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKSync.mm:525
#1  0x000000010f9e7249 in bool std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> >, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_3>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > > const&, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_3) at /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__mutex_base:436
#2  0x000000010f9dd3d7 in bool std::__1::condition_variable::wait_for<long long, std::__1::ratio<1l, 1000000000l>, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_3>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_3) at /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__mutex_base:482
#3  0x000000010f9dd10e in MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKSync.mm:525
#4  0x000000010f911be4 in MVKRenderPipelineCompiler::newMTLRenderPipelineState(MTLRenderPipelineDescriptor*) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:2153
#5  0x000000010f911af8 in MVKGraphicsPipeline::getOrCompilePipeline(MTLRenderPipelineDescriptor*, id<MTLRenderPipelineState>&) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:447
#6  0x000000010f911755 in MVKGraphicsPipeline::initMTLRenderPipelineState(VkGraphicsPipelineCreateInfo const*, mvk::SPIRVTessReflectionData const&) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:506
#7  0x000000010f910361 in MVKGraphicsPipeline::MVKGraphicsPipeline(MVKDevice*, MVKPipelineCache*, MVKPipeline*, VkGraphicsPipelineCreateInfo const*) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:414
#8  0x000000010f911a85 in MVKGraphicsPipeline::MVKGraphicsPipeline(MVKDevice*, MVKPipelineCache*, MVKPipeline*, VkGraphicsPipelineCreateInfo const*) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:332
#9  0x000000010f8fa261 in VkResult MVKDevice::createPipelines<MVKGraphicsPipeline, VkGraphicsPipelineCreateInfo>(VkPipelineCache_T*, unsigned int, VkGraphicsPipelineCreateInfo const*, VkAllocationCallbacks const*, VkPipeline_T**) at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDevice.mm:3274
#10 0x000000010f837338 in vkCreateGraphicsPipelines at /Users/krzysztof/MoltenVK/MoltenVK/MoltenVK/Vulkan/vulkan.mm:996
#11 0x00000001107994e2 in DispatchCreateGraphicsPipelines(VkDevice_T*, VkPipelineCache_T*, unsigned int, VkGraphicsPipelineCreateInfo const*, VkAllocationCallbacks const*, VkPipeline_T**) ()
#12 0x00000001106797b1 in vulkan_layer_chassis::CreateGraphicsPipelines(VkDevice_T*, VkPipelineCache_T*, unsigned int, VkGraphicsPipelineCreateInfo const*, VkAllocationCallbacks const*, VkPipeline_T**) ()
kondrak commented 3 years ago

I just tried the same approach using MoltenVK commit b9b78de (release commit for 1.2.154) and the issue doesn't occur, so whatever is causing this happened somewhere between 1.2.154 and 1.2.162

kondrak commented 3 years ago

@billhollings Using git bisect I managed to find the commit that started causing issues:

https://github.com/KhronosGroup/MoltenVK/commit/85416e297e291e91ce9f7c4237e6082d66849070

I verified it in a few other applications that use sample shading and the issue does occur whenever it's enabled. When I run my application with sample shading disabled, everything works fine. Also, other applications seem to manifest that problem even if gl_FragCoord is not used (but sample shading is enabled), so it seems that I ran into some specific odd case. Quick check indicates that whenever sample shading is enabled, any value for minSampleShading greater than zero causes the lockup in compilation. FWIW, pSampleMask is NULL in my case.

kondrak commented 3 years ago

I managed to reproduce the deadlock using Vulkan-Samples' msaa sample (having enabled sample shading in framework/rendering/pipeline_state.h beforehand) - problem occurs once I tick the "post-processing" checkbox. For completeness, here are the full specs of the hardware I'm running on:

Macbook Pro, Retina, 15-inch, mid-2015 2.2Ghz Quad Core Intel Core i7 16 GB 1600 MHz DDR3 Intel Iris Pro 1536 MB macOS Big Sur 11.2.3

I think this is as much information as I can provide at this point - please let me know if you need anything else! :)

kondrak commented 3 years ago

@cdavis5e I only now just realized that you authored the commit in question - maybe you have any idea what the exact problem here might be?

cdavis5e commented 3 years ago

All that change does is offset FragCoord by the sample position, as required by Vulkan, when the shader runs at sample rate. I'm wondering if this is a bug in the Metal driver's pipeline compiler. That condition variable is supposed to get signaled when the compiler finishes, but it looks like that's not happening.

I think we should file a feedback with Apple over this. Would you like to do it, or do you want me to do it? If you want me to do it, I'll need a System Profiler report from your system, as well as a sysdiagnose(8) dump from just after you reproduced it.

Also, we might be able to work around this. Does changing SPIRV-Cross to put the fixed-up position in a local variable, instead of changing the [[position]] parameter, fix the hang?

kondrak commented 3 years ago

I'll try to provide what you ask soon enough. Regarding the SPIRV-Cross question - how exactly would I go about verifying that? I'm not using it directly and in terms of code I haven't really analyzed it at all.