RobertBeckebans / RBDOOM-3-BFG

Doom 3 BFG Edition source port with updated DX12 / Vulkan renderer and modern game engine features
https://www.moddb.com/mods/rbdoom-3-bfg
GNU General Public License v3.0
1.38k stars 247 forks source link

Metal asserts on Apple M2 Pro with API validation turned on #824

Closed lowlevelsoul closed 3 months ago

lowlevelsoul commented 6 months ago

The Metal driver on my M2 MacBookPro constantly crashes with API validation when trying to clean up retired resources, in this case, a command buffer. Turning Metal API validation off silences the report and there appears to be no ill side-effects; I'm reporting this, just in case it's a symptom of a larger issue.

The Metal driver spits out the following information;

-[MTLDebugDevice notifyExternalReferencesNonZeroOnDealloc:]:3190: failed assertion `The following Metal object is being destroyed while still required to be alive by the command buffer 0x13dabcc00 (label: vkQueueSubmit MTLCommandBuffer on Queue 0-0): <MTLToolsObject: 0x600003885110> -> <AGXG14XFamilyBuffer: 0x12aa6b2e0> label = Argument buffer length = 24 cpuCacheMode = MTLCPUCacheModeDefaultCache storageMode = MTLStorageModeShared hazardTrackingMode = MTLHazardTrackingModeTracked resourceOptions = MTLResourceCPUCacheModeDefaultCache MTLResourceStorageModeShared MTLResourceHazardTrackingModeTracked
purgeableState = MTLPurgeableStateNonVolatile'

In Debug, this happens during map load of the first map. In release, it'll happen a bit later after the cut-scene has run, so it could be a genuine race condition between the GPU and CPU.

SRSaunders commented 6 months ago

This may indeed be a race condition between the release of Metal command buffers and the Metal argument buffer resources they refer to. Perhaps an issue inside MoltenVK or nvrhi - tbd.

To test this theory, please try the following changes. I may not want to make these permanent, as the sync issue may be elsewhere (or an artifact of the delays from Metal API validation itself), but let's start here first.

For debug builds make sure your Xcode scheme environment variable is set as follows:

MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS = 0

For release builds make the following change inside neo/sys/sdl/DeviceManager_VK.cpp:

Change:

// SRS - Turn MoltenVK's Metal argument buffer feature on for descriptor indexing only
if( mvkConfig.useMetalArgumentBuffers == MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS_NEVER )
{
    idLib::Printf( "Enabled MoltenVK's Metal argument buffers for descriptor indexing...\n" );
    mvkConfig.useMetalArgumentBuffers = MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS_DESCRIPTOR_INDEXING;
    vkSetMoltenVKConfigurationMVK( m_VulkanInstance, &mvkConfig, &mvkConfigSize );
}

To:

// SRS - Disable MoltenVK's Metal argument buffer feature
if( mvkConfig.useMetalArgumentBuffers != MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS_NEVER )
{
    idLib::Printf( "Disabled MoltenVK's Metal argument buffers...\n" );
    mvkConfig.useMetalArgumentBuffers = MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS_NEVER;
    vkSetMoltenVKConfigurationMVK( m_VulkanInstance, &mvkConfig, &mvkConfigSize );
}
lowlevelsoul commented 6 months ago

Both MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS = 0 and the other change seemed to solve the crash with API validation turned on. It's a shame to lose argument buffer usage though, so I dug a bit deeper (I've only just started looking into the code a day or so ago)

One of the things I did notice in whilst tracing the initial crash yesterday, was that when looking through the call stack, the code for determining when to retire a resource in vulkan-queue.cpp, is as follows;

if (cmd->submissionID <= lastFinishedID)

At the time, the comparison seemed a bit suspect to me, as I wondered if a resource could be garbage collected before the commands for the current frame have run or have completed. So I ran my own test (without your suggested changes ) by making the following change :-

if (cmd->submissionID < lastFinishedID)

A quick test shows that this also works ( with argument buffer usage enabled ) without ill effect. I no longer get the Metal driver assert with API validation turned on. I don't know how many buffers RBDOOM-3 can have in-flight, but this check should probably be something along the lines of :-

if (cmd->submissionID < lastFinishedID - MAX_BUFFERS_INFLIGHT )

The downside is that the renderer would keep a bit more memory around for a one or two frames, but it would be a sure-fire, brute-force way to ensure that the GPU is really finished with the resource before GC'ing it.

SRSaunders commented 6 months ago

First of all, thanks for testing quickly and confirming that MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS can influence the issue, and kudos for your extra research.

On the topic re Queue::retireCommandBuffers() - that vulkan routine is located within the external nvrhi library and affects all Vulkan platforms (Windows, Linux, macOS). I looked at the same <= comparison as well, but it should work as is. The lastFinishedID variable already takes into account the number of frames in flight and is set by querying a timeline semaphore that tracks vulkan command buffer completion. However, I am suspicious that it may not properly handle the case when external validation is happening in parallel (I.e. Xcode's Metal API validation tool) and the underlying Metal command buffer is not truly finished due to validation delays. I suppose it could be a MoltenVK issue that is setting Vulkan command buffer completion before the Metal command buffer is fully released (including by the Xcode validation tool), but that will require deeper research.

Setting the comparison to < versus <= is a brute force way of guaranteeing completion, but it would keep defunct frame resources around 1 frame longer. And I am not so anxious to make this custom change just for macOS/MoltenVK unless we can show something similar happening on other platforms. It would require a bunch of cross-platform testing. As I mention above, it may be an artifact of MoltenVK being a portability library and not a native implementation, coupled with extra processing and resource grabbing from Xcode's Metal API validation tool.

I could do something simple in the mean time:

For debug builds, I could set the default build settings in cmake-xcode-debug.sh to:

cmake -G Xcode -DCMAKE_BUILD_TYPE=Debug -DCMAKE_XCODE_GENERATE_SCHEME=ON -DCMAKE_XCODE_SCHEME_ENVIRONMENT="MVK_CONFIG_FULL_IMAGE_VIEW_SWIZZLE=1;MVK_CONFIG_SYNCHRONOUS_QUEUE_SUBMITS=0;MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS=0" -DCMAKE_XCODE_SCHEME_ENABLE_GPU_API_VALIDATION=ON ...

This would disable Metal argument buffers and enable Metal API validation for Xcode debug builds, but leave Metal argument buffers enabled for release builds where Metal API validation is turned off by default. Would this workaround be satisfactory until there is more time for research?

SRSaunders commented 6 months ago

I looked deeper into the problem and suspect the underlying issue may be with MoltenVK. I have raised the following issue https://github.com/KhronosGroup/MoltenVK/issues/2062 to track it.

If this is indeed the problem, then no changes should be required to either nvrhi or RBDoom3BFG. Let's wait and see.

SRSaunders commented 5 months ago

After discussion within https://github.com/KhronosGroup/MoltenVK/issues/2062, I have decided to close that issue. You can read the text, but I suspect a timing edge case on macOS/MoltenVK involving nvrhi's use of timeline semaphores for detecting command buffer completion. Enabling Metal's API validation likely reveals the edge case, but without it there are no ill effects in the application. Given this is common cross-platform Vulkan code I don't plan to change or fix in nvrhi or RBDoom3BFG.

Note the default cmake build configurations sets CMAKE_XCODE_SCHEME_ENABLE_GPU_API_VALIDATION=OFF so this issue should not be a problem for Xcode default builds.

If you really want to run with Metal API Validation enabled, you could try adding/setting the following environment variable MTL_DEBUG_LAYER_VALIDATE_UNRETAINED_RESOURCES=0 to your Xcode build scheme. This may or may not work so no guarantees. I recommend running with Metal API Validation off since RBDoom3BFG is a Vulkan app on macOS and it's up to MoltenVK to handle the Metal stuff. When developing the macOS port, I concentrated on eliminating Vulkan errors by using the Vulkan Validation Layer which you can enable by setting r_useValidationLayers=2 in your autoexec.cfg file.