KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.81k stars 425 forks source link

Frame rate performance is clamped by 30 / second on macOS (macBook Pro / i7) #581

Open HongkunWang opened 5 years ago

HongkunWang commented 5 years ago

I'm testing few of my game prototypes on macOS and iOS and noticed that the frame rate on macOS seems to be unable to break 30/second. While on the iOS side, it gives reasonable frame rate from ~40 to ~80 frame/second on difference iOS devices, like iPhone 6, 6 plus, 8, iPad 5th.

I tried different solutions at the following parts: (1) The Swapchain present mode, I tried the following modes:

VK_PRESENT_MODE_FIFO_KHR - the only mode universally supported
VK_PRESENT_MODE_MAILBOX_KHR - the lowest-latency non-tearing mode, may not be available on every platforms
VK_PRESENT_MODE_IMMEDIATE_KHR - the lowest-latency tearing mode, may not be available on every platforms

(2) The rendering synchronization, I have tried function vkWaitForFences(...) or vkQueueWaitIdle(...) (the API says that they are equivalent to each other in some way) to do the synchronization.

It turns out that none of the above attempts works, the frame rate still can not exceed 30. Is there any display setting I should change in may macBook Pro? Does anyone have suggestions for this?

danginsburg commented 5 years ago

Does setting this change anything? https://github.com/KhronosGroup/MoltenVK/blob/fc1ce42a37acbf62e64793cec7bb5a5721d667ef/MoltenVK/MoltenVK/API/vk_mvk_moltenvk.h#L256

I've found that Metal on macOS limits the update rate to vsync, although if I recall it can be worked around in fullscreen with the direct presentation method. It's also possible your synchronization method is inserting a GPU bubble. Generally you never want to do vkQueueWaitIdle and you want to generally let the CPU be able to run a bit ahead of the GPU. For example, we only will do the equivalent of vkWaitForFences if we detect our renderthread running more than "max latency" # of frame ahead of the GPU.

danginsburg commented 5 years ago

Oh, another thing: make sure you are rendering on your discrete GPU on the laptop display as a first test.

HongkunWang commented 5 years ago

Thanks, I'll give it a try to see the result very soon. By the way, VkBool32 presentWithCommandBuffer is a member of struct MVKConfiguration, how can I get that struct instance? Is there a global instance of that?

As for synchronization, I did use vkWaitForFences() first, but it gave me "screen-tearing" effect, means the next fame rendering messes up the previous frame rendering, which means rendering is not fully completed before we start rendering the next frame, I use the mechanism from Vulkan sample code and can not figured out the reason, therefore I started using vkQueueWaitIdle() and it solved the problem. I know it may introduce GPU bubble, but seeing that an iPhone 8 can generate 80 frame/s, I believe there must be something else bother me to clamp the frame rate on macBook.

HongkunWang commented 5 years ago

I tried the following code:

        MVKConfiguration mvkConfig;
        size_t configSize;
        vkGetMoltenVKConfigurationMVK(vulkanInst->vkInstance, &mvkConfig, &configSize);
        mvkConfig.presentWithCommandBuffer = false;

However, I got a linking error saying that vkGetMoltenVKConfigurationMVK is an undefined symbol. I can not figure the reason. I do not link to MoltenVK static library since I use dynamic library libMoltenVK.dylib, so I have to manually get the function pointer from that dylib, which takes more time.

However, I have proven that the frame rate clamping is due to the display vsync, because when I run the my application on an external LCD monitor, the frame rate is now clamped to 60 /s !

HongkunWang commented 5 years ago

Hi, I made my code mvkConfig.presentWithCommandBuffer = false; working, but it does not have any effect. The frame rate is still clamped to either 30 or 60, depends on where the macOS application runs, on built-in display or external display.

KenThomases commented 5 years ago

Could your Metal view/layer be compositing with other windows or views/layers? Does it have anything in front of it (even shadows from windows or the Dock nearby)? Is it transparent, such that stuff behind can show through? Are you overriding your view to return true from isOpaque? Have you set the opaque (Objective-C) or isOpaque (Swift) property on the Metal layer, if you're creating it yourself?

aerofly commented 5 years ago

Hello there, one thing to keep in mind on Mac OS is the fact that if you use CVDisplayLink for rendering you have an automatic VSYNC already enabled, at least that were our findings. And for some reason this did interfere in certain cases with VK_PRESENT_MODE_FIFO_KHR as in some cases we saw tearing, even though we had a stable frame rate!?

So the way we do it, instead of creating a CVDisplayLink we create a dedicated rendering thread that continously calls our rendering function. If VK_PRESENT_MODE_FIFO_KHR is enabled we get synchronization to VSYNC in fullscreen mode, if not we get an unlimited frame rate ( usually for testing purposes ). One thing that doesn't work with this method is that VK_PRESENT_MODE_FIFO_KHR does not do VSYNC mode when in window mode.

It was also our observation that we get best performance with

    mvkConfig.synchronousQueueSubmits     = false;
    mvkConfig.presentWithCommandBuffer    = false;

Any value set to true had a negative effect.

HongkunWang commented 5 years ago

Thanks for all your helpful suggestions! I have to say that I do not have thorough understanding on macOS display / layer / animation, but I'll try to read related docs and modify my code and let you guys know the result. As for animation, I use old-style "Run loop Timer" (CFRunLoopTimerCreate()) to generate animation since I think run loop timer can be more light-weight than CVDisplayLink.

One more question I'd like to ask: I noticed that even I build my app with Release Build of MoltenVK, I still noticed that Metal API Validation Enabled and Metal GPU Frame capture Enabled. I'd like know how to disable them for better performance, does anyone have a quick answer? Thanks!

KenThomases commented 5 years ago

Those diagnostic features are enabled by Xcode based on the settings of the Run scheme. You can change those settings using Product > Scheme > Edit Scheme. Those diagnostics features are only enabled if you run your app from Xcode. It's not built into your app, it's due to the environment that Xcode establishes for running your app. If you (or anybody else) runs your app outside of Xcode, those won't be enabled.

HongkunWang commented 5 years ago

So it means there is no concern on the performance for the final product. :-) Thanks Ken!

cjay commented 5 years ago

I'm not sure if this is related, but vkGetPhysicalDeviceSurfacePresentModesKHR lists only VK_PRESENT_MODE_FIFO_KHR and VK_PRESENT_MODE_IMMEDIATE_KHR in my code that is derived from vulkan-tutorial.com. Is VK_PRESENT_MODE_MAILBOX_KHR not supported by MoltenVk in general? It probably fell back to a default when OP tried to use VK_PRESENT_MODE_MAILBOX_KHR.

billhollings commented 5 years ago

@cjay

VK_PRESENT_MODE_MAILBOX_KHR is not supported on MoltenVK, because the internal single-entry queue required to support that capability is not available through Metal.

FunMiles commented 4 years ago

I am having a performance issue in a test I wrote for the Vookoo framework. I think it is related to this. My test is found at https://github.com/FunMiles/Vookoo/tree/lock_guard_queues (example parallelTriangles) for those who would like to look into it. I have put some timings that show that the submission to command queues, when there's more than one window (and hence Vulkan Surface) is about 16ms which is the refresh timing of my monitor. If only one thread/window exists, then the timing is about 0.05ms. To me it indicates that the command buffer is being held back by the frame presentation. I would like to set the flags

     mvkConfig.synchronousQueueSubmits  = false;
     mvkConfig.presentWithCommandBuffer = false;

I created a rather awkward code to retrieve the two routines vkGetMoltenVKConfigurationMVK and vkSetMoltenVKConfigurationMVK. It uses the environment variable to locate the MoltenVK_icd.json file and then uses the content of the file to locate the library. I can retrieve the symbols, they come out as non null, but when I try to call them, I get a crash immediately, not even in the routines. Do the symbols not point to a function? Could someone point to a sample code that does what I am trying to do correctly with the current version of MoltenVK? One code I found calling these functions seems to be with a different API.

billhollings commented 4 years ago

@FunMiles

If you are using Vulkan Loaders and Layers from the Vulkan SDK, you will not able to call vkGetMoltenVKConfigurationMVK() and vkSetMoltenVKConfigurationMVK() because Vulkan Loaders and Layers from the Vulkan SDK do not support or understand those calls. The MoltenVK_Runtime_UserGuide.md mentions this, but perhaps not clearly enough.

An alternative to setting synchronousQueueSubmits with the configuration API is to set the equivalent environment variable, MVK_CONFIG_SYNCHRONOUS_QUEUE_SUBMITS within your app. See the documentation in the vk_mvk_moltenvk.h file for more on this.

The presentWithCommandBuffer config setting is now obsolete and does nothing. All surface presentations are now performed with a MTLCommandBuffer that is created when the vkQueuePresentKHR() command is submitted to the queue.

FunMiles commented 4 years ago

@billhollings Thanks for the info. I will look at the vk_mvk_moltenvk.h for more info. I had read the user guide and had seen the comment. The way I interpret it was that you could not get access to those routines and thus had to go the way @HongkunWang did. Hence my question.

Now your last comment seems to indicate that I cannot influence the presentation methodology? What is the proper way to send commands and present the results from multiple threads to multiple VkSurfaceKHR ? If only one window at a time can have a 60Hz refresh, it is going to frustrate a lot of development on my side.

PS: After setting the environment variable to false, my frame rate didn't improve. I get 30Hz with two windows open and 60Hz with one. Should presentation be sent for all opened windows from a separate thread to a separate queue? Would that help?

In addition I keep getting validation layer messages for the type following type. Note that every image acquired is immediately sent for presentation after the drawing commands have been submitted:

00000008 debugCallback: [ VUID-vkAcquireNextImageKHR-swapchain-01802 ] Object: 0x1c000000001c (Type = 27) | vkAcquireNextImageKHR: Application has already previously acquired 3 images from swapchain. Only 2 are available to be acquired using a timeout of UINT64_MAX (given the swapchain has 3, and VkSurfaceCapabilitiesKHR::minImageCount is 2). The Vulkan spec states: If the number of currently acquired images is greater than the difference between the number of images in swapchain and the value of VkSurfaceCapabilitiesKHR::minImageCount as returned by a call to vkGetPhysicalDeviceSurfaceCapabilities2KHR with the surface used to create swapchain, timeout must not be UINT64_MAX (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkAcquireNextImageKHR-swapchain-01802)