KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
777 stars 407 forks source link

Increased CPU consumption over time #8934

Open fridenmf opened 2 days ago

fridenmf commented 2 days ago

Environment:

Describe the Issue

The validation layers consume more CPU time the longer an application runs if an application often create and destroys resources. The first seconds the validation layers take up less than 1 ms CPU time, but after about 40 seconds it takes about 8 ms, and it continues to increase the longer you run. An application starting on 200 FPS gets down to about 40 FPS after about 2 minutes.

There is a video under Additional context demonstrating this, as well as Visual Studio performance profile results showing the amount of time spent in the validation layers compared to a "dummy 4 ms function", where the dummy function takes up more time than the validation layers in the first seconds of the application, but only a fraction of the time of the validation layers 40 seconds later.

Expected behavior

The validation performance should not get worse over time if the total number of resources used in a frame does not increase over time.

Additional context

Video:\ ![vvl_performance_loss](https://github.com/user-attachments/assets/4eb41ade-90eb-4b40-a432-8abe5b54c217) Visual Studio performance profile after 8 seconds:\ ![after_8_seconds](https://github.com/user-attachments/assets/b9257718-e859-4735-8575-65668a76661e) Visual Studio performance profile after 40 seconds:\ ![after_40_seconds](https://github.com/user-attachments/assets/5fafb363-d16c-49c0-8d83-d06ca490224f)
spencer-lunarg commented 2 days ago

Thanks to bringing this to our attentions, we will take a look into this locally and see if we can reproduce and then hopefully track down what is going on

arno-lunarg commented 2 days ago

Would you be able to provide an executable so we can have a proper look at how VVL behaves with your application please?

artem-lunarg commented 2 days ago

@fridenmf Is synchronization validation enabled? If yes, does it behave the same if only standard validation is enabled?

fridenmf commented 2 days ago

@arno-lunarg Sure! The following app has the ability to load the validation layers from the application if the "validate"-checkbox is ticked. Leave it unticked if you reproduce it vkconfig.exe instead\ TODO.zip

@artem-lunarg Seems to only get the slowdown when "Synchronization -> Submit time validation" is on.

artem-lunarg commented 2 days ago

@fridenmf I guess the application uses timeline semaphores. If yes, does it only slow down the app or there is also significant increase of memory usage (as reported by the OS tools)?

One more question, are there synchronization validation errors? (this can increase resource usage if we made a mistake)

fridenmf commented 2 days ago

@artem-lunarg It does not use timeline semaphores, it's mostly Vulkan 1.0 except from using VK_KHR_get_physical_device_properties2 for displaying driver properties. I've measured memory and I don't see any signs of memory leaks. There is no validation errors or warnings except from a BestPractices-vkCreateSwapchainKHR-suboptimal-swapchain-image-count due to me using two swapchain images instead of the recommended three.

fridenmf commented 2 days ago

When implementing resource reusage for all resources created each frame I don't see the performance slowdown. Binary search lead me to the VkImage being created every frame with usage VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT, VK_SAMPLE_COUNT_1_BIT. When it's reused there is no slowdown in VVL, when it's re-created every frame there is slowdown in the VVL. It should narrow down the search.

artem-lunarg commented 22 hours ago

The issues is reproducible and checking the older SDKs it seems we always had it (or at least for quite some time).

artem-lunarg commented 13 hours ago

I think I have a solution, need to polish it a bit, hopefully next Monday, but the app framerate is stable now.

Some details for documentation purposes.

It's related to how the app does synchronization using swapchain acquire fence. The last time the fix was needed for Core Validation (https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/8880) which supported this sync mechanism, so it was a bug fix. It turns out syncval also needs adjustments but for different reasons.

The problem was that data structures that track memory accesses did not release entries related to deleted images. Usually this cleanup of old entries is done when synchronization says that it's not needed anymore, for example, when QueueSubmit uses fence or other type of common sync mechanisms, but acquire fence synchronization needs additional support in synval. It does not guarantee that memory operations are finished (except it is enforced by other synchronization means), and establishes only execution dependency. For this case the solution is to clean up resource access entries when resource is deleted (which sounds logical, we just didn't need this previously).