Add "previous" member to MVKPerformanceTracker structure

SRSaunders commented 3 months ago

Addresses #2177.

This small PR adds a previous member to the MVKPerformanceTracker struct and saves the former latest value before updating latest within updateActivityPerformance(). It also adds previous to MVKLogInfo.

Note that I did not change the MVK_PRIVATE_API_VERSION.

I still have a question about how to determine version information at runtime. Now that vkGetVersionStringsMVK() is deprecated, what is the offical means of determining MoltenVK's version or the MVKPerformanceStatistics version at runtime? This is to handle cases when dynamic linking is used and one needs to determine if the new field is present.

Is the only mechanism to use the VK_SUCCESS or VK_INCOMPLETE return code from vkGetPerformanceStatisticsMVK()? This approach seems limiting since all you get is a success or failure indication and not version information. For instance if you always want to use latest, and optionally use previous (if available), how would you go about determing that at runtime?

SRSaunders commented 3 months ago

Thanks for the hints re VkPhysicalDeviceProperties::driverVersion and VkLayerProperties::implementationVersion for retrieving the MoltenVK version at runtime.

However, after thinking about this problem a bit more I will be reverting back to the simple approach of guarding the new code at compile time using MVK_MAKE_VERSION(1, 2, 9), and checking for VK_SUCCESS returned from vkGetPerformanceStatisticsMVK() at runtime. My logic is since the perf structure is not a linear list, the runtime and compile time ABI versions (i.e. struct sizes) have to match exactly to get the correct offsets into the returned memory blob. Even if you are dynamically linking to a later version of MoltenVK (i.e. with previous defined), the struct sizes must still match exactly for the data to make sense. Rather than carry around multiple version copies of the performance struct in my code (with different versioned offsets), I plan to simply require a match. This is the downside of using a nested struct vs. a linear list, but that's a more general issue than I want to tackle here.

FYI, I looked at mvkCopyGrowingStruct() to confirm how it behaves, and unfortunately I think I have found an error present in the code:

The current logic is: return (*pCopySize == origSize) ? VK_SUCCESS : VK_INCOMPLETE; Shouldn't it instead be: return (sizeof(S) == origSize) ? VK_SUCCESS : VK_INCOMPLETE;

The original logic might be okay for linearly growing lists, but is not okay for nested structures like performance data. The only concern I have is this function is also used for MVKConfiguration and MVKPhysicalDeviceMetalFeatures which I think are both linear. While the current behaviour is different from the function's doc statement (_Returns VK_SUCCESS if the original value of *pCopySize is the same as the actual size of the struct, or VKINCOMPLETE otherwise), I don't want to break something unintentionally. Please advise.

If you agree with this change, I will submit commits for this as well as bumping MVK_PRIVATE_API_VERSION.

SRSaunders commented 3 months ago

I decided to make the proposed changes above and @billhollings can review and comment as needed.

FYI - I have one more thing I would potentially like to add to this - a new MVKQueuePerformance entry that tracks async encoding start delay if the thread is already busy encoding the previous frame. I need to research this a bit more before confirming and will come back with an answer asap. So no rush on the merge.

SRSaunders commented 3 months ago

I have completed my research and have submitted an additional commit that proposes and implements two new performance counters in MVKQueuePerformance for measuring asynchronous queue submit wait times: waitSubmitCommandBuffers, and waitPresentSwapchains. These new performance counters are useful when MVK_CONFIG_SYNCHRONOUS_QUEUE_SUBMITS is disabled and queue submits are asynchronous, since there are delays between the initial calls to vkQueueSubmit()/vkQueuePresentKHR() and the actual start of operations.

For instance, waitSubmitCommandBuffers allows you to determine the delay time before the start of command buffer encoding if the thread is already busy encoding the previous frame. And waitPresentSwapchains allows you to predict the delay between calling vkQueuePresentKHR() and the actual swapchain image showing up on the screen - e.g. waitPresentSwapchains + presentSwapchains.

Here are screen grabs of the embedded Optick profiler running inside RBDoom3BFG with MoltenVK. You can see the Submit_Wait times (corresponds to waitSubmitCommandBuffers), Acquire_Wait times (corresponds to retrieveCAMetalDrawable), and Present_Wait times (corresponds to waitPresentSwapchains) and their relationship to command buffer encoding and execution on the GPU.

I have now completed my work on this PR and it is ready for review.

Screenshot 2024-03-18 at 10 43 39 PM

Screenshot 2024-03-18 at 11 51 01 PM

KhronosGroup / MoltenVK

Add "previous" member to MVKPerformanceTracker structure #2183