GPUOpen-LibrariesAndSDKs / VulkanMemoryAllocator

Easy to integrate Vulkan memory allocation library
MIT License
2.62k stars 358 forks source link

vmaCreateBuffer is prohibitively slow when bufferImageGranularity > 1 #178

Closed prideout closed 2 years ago

prideout commented 3 years ago

On devices that have bufferImageGranularity set to 1, we observe that vmaCreateBuffer can be two orders of magnitude faster than with devices that have bufferImageGranularity set to a higher value. In one extreme example, we have an asset that loads in less than 1 second on a device with bufferImageGranularity=1 and loads in 1.5 minutes on a device with bufferImageGranularity=4096.

Capturing a trace of the CPU work reveals that the CheckAllocation function is called repeatedly in the slow case.

Can this slow path be avoided by tweaking some of the configuration, or using a custom pool?

prideout commented 3 years ago

p.s. I'm able to get past this using VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT.

Since all VkBuffer objects are considered "linear" according to the Vulkan spec, I wonder if VMA should automatically use a separate pool for the vmaCreateBuffer path.

adam-sawicki-a commented 3 years ago

Yes, VMA checks surrounding allocations to ensure bufferImageGranularity is respected while still keeping all types of resources together. This is a tradeoff for lower memory consumption over performance.

You probably hit some very bad case. Can you please share a CSV "recording" (there is a feature in VMA to generate this) so I can analyze it?

Please check you use the latest version of the library from master branch, as the performance in this regard has been greatly improved by contribution from @kd-11, #163, merged 2021-02-16.

A good workaround is to create custom pool for linear resources, another for OPTIMAL images to keep them separate and use VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT on both.

I have a plan to introduce some optimization to this, to automatically keep linear resources and OPTIMAL images separate, like DX12 does with RESOURCE_HEAP_TIER_1. However, this is not that easy as the user can request memory by calling vkAllocateMemory without telling VMA what will be bound there. One idea is to continue checking neighboring allocations, but try to keep linear resources in other memory blocks than OPTIMAL images when there are many memory blocks already. What do you think?

prideout commented 3 years ago

Thanks for the response! I've updated our copy of VMA to today's master branch, and the performance issue is still there.

I'm attempting to capture a CSV file for you, this issue only occurs on Android devices so it's a bit tricky.

Your plan sounds good, one comment is that we never call vkAllocateMemory on our own, we trust VMA to do this for us everywhere. I suspect that most clients are similar in that they either use VMA everywhere, or use it nowhere. Maybe this could simplify your plans.

prideout commented 3 years ago

Doh, I lied. We do call vkAllocateMemory directly in a few spots. (e.g. when allocating swap chain images)

kd-11 commented 3 years ago

Hi @prideout. The fix I had provided to solve this issue only really works when the size of memory requested is aligned to the bufferImageGranularity which will trigger the exemption path each time. Of course this can lead to wasted memory if you're allocating only a few bytes at a time, but it works really well otherwise with performance equal to the bufferImageGranularity=1 case. It also does not violate spec for mixed usage and offers the best of both.

prideout commented 3 years ago

Quick update on CSV generation on Android: I have VMA_RECORDING_ENABLED set to 1, and I'm doing:

    const VmaRecordSettings csvSettings { .pFilePath = "car.csv" };
    const VmaAllocatorCreateInfo allocatorInfo {
        .physicalDevice = physicalDevice,
        .device = device,
        .pVulkanFunctions = &funcs,
        .pRecordSettings = &csvSettings,
        .instance = instance
    };

Seeing VK_ERROR_INITIALIZATION_FAILED

prideout commented 3 years ago

Ah, the fopen in VmaRecorder::Init i failing on Android. Let me add the permission to my manifest to see if that fixes it.

prideout commented 3 years ago

Adding file permission to the manifest wasn't sufficient so I've attempted to simulate the issue using Filament's macOS build and setting VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY.

Two CSV's attached, one with and one without the fake granularity setting.

(It's actually not difficult to repro on an actual Android device, if y'all are interested I can give repro steps.)

car_good.csv

car_bad.csv

adam-sawicki-a commented 3 years ago

Thank you very much. I will investigate this issue further as soon as I find some time. For now I recommend the workaround with using separate custom pools with VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT flag.

Another idea is to define VMA_MIN_ALIGNMENT to the value equal to the bufferImageGranularity of your platform. Would it help, or would it waste too much memory in between small allocations?

prideout commented 3 years ago

Yes, the custom pool solution is working well for now. I think using min_alignment would cause a lot of waste for this model since there are many small allocations and the device's granularity is 4096.

adam-sawicki-a commented 3 years ago

I created prototype of an optimization that places buffers <= 4 KB in separate memory blocks with bufferImageGranularity check disabled. Can you please check it it solves the problem for you? You can find it on a branch: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator/tree/feature-small-buffers Please disable using custom pools and see if the allocations work faster now. The implementation is not finished but it should work already. Any feedback is welcome.

adam-sawicki-a commented 3 years ago

Hello, did you have a chance to try this solution? I am willing to push it to the master branch if it helps the performance in your case.

prideout commented 3 years ago

Thanks for looking into this! Unfortunately I have not had a chance to try this out, I can try it out next time I'm looking at performance on a Mali-based device.

RandomShaper commented 3 years ago

I've tested this in the Godot engine and there's a clear improvement: https://github.com/godotengine/godot/pull/51524

prideout commented 3 years ago

Thanks for trying this out @RandomShaper. I am swamped and I don't think I'll have time to try this out any time soon.

RandomShaper commented 2 years ago

I've done some additional research and brought a new approach to the same optimizations here in https://github.com/godotengine/godot/pull/57989.

TL;DR There I've used the latest vk_mem_alloc.h with a patch so that non-default pools can have mixed memory types, just like the default one, which allows the user code to use a separate pool for small objects without having to care about memory types (no need to create as many pools as memory flags combinations in user code, which also are hard to collect). I'm also saying there that if the maintainers here find such a thing useful I can make a flexible version of it in which the pool being "universal" is opt-in via a flag, keeping the current behavior of memory type specific pools the default.

adam-sawicki-a commented 2 years ago

Thank you for this research. We are willing to solve this problem of performant handling of many small buffers with large buffer-image granularity in the upcoming major release.

VMA now features new TLSF allocation algorithm that also handles buffer-image granularity more efficiently. We need to implement defragmentation for this algorithm before we can make it the default. However, if you don't use defragmentation feature, you can test VMA with TLSF as the default algorithm using separate branch I just created: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator/tree/featrure-TLSF-default

Can you please test it and see if it works for you? I hope it will make these performance problems go away.

RandomShaper commented 2 years ago

Thank you very much.

Unfortunately, benchmarking has shown that, for us, TLSF is slower than linear + separate pool for small allocations. See https://github.com/godotengine/godot/pull/58137.

adam-sawicki-a commented 2 years ago

Linear algorithm is obviously the fastest, but as it cannot reuse free space in between allocations, it cannot be used for any general cases. Using custom pools for special purposes and custom needs is still recommended.

I am glad to hear that you were able to use VMA with the default switched to our new TLSF algorithm and it worked correctly.

Can you please test if switching to this new branch with TLSF algorithm works faster in your app than just using the default, without the custom pool with linear algorithm for small buffers, on platforms with high buffer-image granularity? That would be very useful for us.

RandomShaper commented 2 years ago

Sorry for the late reply.

This is the result of the benchmark I've been using. It creates many different kinds of objects, which involves exercising VMA in many cases. No custom pools at all this time. Buffer image granularity is 1024 in my case.

VMA master VMA feature-TLSF-default
6039 ms 6194 ms

Did I undertand correctly what you asked for? If not, please let me know and I'll benchmark again.

P. S.: Taking the chance to thank you for your great articles on Vulkan, which have been of great help.

adam-sawicki-a commented 2 years ago

Thank you for the results. This is interesting. So TLSF is working even slower than the default in this test? Can you please share some more details about the test you executed or a piece of code?

RandomShaper commented 2 years ago

Some time ago some user reported that the instantiation of objects in the still under development future version of the Godot game engine. was slower than expected. During investigation it was found that VMA's experimental branch (feature-small-buffers) improved performance in the benchmark that was designed for this issue.

Since it's clear that VMA has a big role in the way the benchmark script exercises the engine, I've based on it all the results I've told about in this thread.

The benchmarking consists in running a script (ClassProfiling4.zip in the Godot editor (I did on Windows). You get the total run time both in the console window and in the generated results.txt file.

If you would like to get pre-built binaries from out CI (debug and/or release) and/or further instructions so you can investigate with little hassle, just let me know.

adam-sawicki-a commented 2 years ago

OK I see, so you used LINEAR allocator for small buffers and after reverting to putting them all in the default heaps + switched to TLSF algorithm, the test takes only +2.57% time? This is actually a good news and matches our internal tests.

I recommend to do it this way, as LINEAR allocator is not able to free memory in between allocation, so its memory usage can grow indefinitely. It should not be used for allocations that are created and destroyed in random order for a long period of time.

adam-sawicki-a commented 2 years ago

With new commit 88510e9 we switched to TLSF as the default algorithm. Can you please check if if works fine in your project?

RandomShaper commented 2 years ago

I've smoke tested and benchmarked it (PR here: https://github.com/godotengine/godot/pull/58491).

The benchmark takes now 6778 ms on average. If I patch VMA to create the default pool with linear, it's 6071 ms. Therefore, the difference is ~10% this time. Is that within the expectations?

adam-sawicki-a commented 2 years ago

That's possible. Linear will always be the fastest as it is quite naïve algorithm, but as I said above, it is not suitable for random allocations. Can you do a following experiment: Use the custom pool as you do right now, but compare using default (which is TLSF in the latest code) versus linear algorithm, measure time as well as memory consumption?

RandomShaper commented 2 years ago

I just tested. It's now 7.5% slower (instead of ~10%).

Our code now lazily creates pools for small objects. Since VMA doesn't allow out of the box to have custom pools that support every memory type, when a buffer or texture < 4 KiB needs to be allocated it finds the memory type index of allocations and creates a pool setup for that memory type if it hasn't been created yet.

Does this way of approaching it make sense? Having "universal pools" is VMA (pools that support a mix of memory types, like I did a few iterations ago) would make this easier, though.

adam-sawicki-a commented 2 years ago

Yes, it might be a good approach, as long you don't use linear algorithm for those pools, because as I said before, this algorithm is not good for random allocations, it may grow indefinitely in size.