PKFX rendering tasks triggers assert for misaligned memory address

akioCL commented 1 year ago

When copying vertex buffer data (like for billboards) the PKFX rendering tasks expect the destination address to be aligned to 16 byte. The destination address is the result of mapping the destination buffer, but the Atom renderer does not guarantee any type of alignment when mapping buffers. Any required alignment has to be handled by PKFX.

This issue was discovered when introducing a new memory allocator for Atom Vulkan (VMA). The current implementation (without VMA) is working by pure coincidence because of the staging buffers used for copying happen to be aligned to 16 byte.

This issue is currently blocking the integration of the new memory allocator, since it breaks the PKFX gem. In order to test the new memory allocator please use the following PR https://github.com/o3de/o3de/pull/15832

To run Atom with the Vulkan renderer pass the "-rhi=vulkan" into the command line.

moudgils commented 1 year ago

Tagging @HugoPKFX @ValPKFX for visibility. This is blocking VMA integration into O3de.

HugoPKFX commented 1 year ago

Hi,

Thanks for reporting the issue, this sparked an internal discussion on the need to support non-16byte aligned GPU memory allocators. Our first thought is not to support that in billboarding helpers (where it's currently asserting): forking all this heavily SIMDified code to handle unaligned streams, we're not fond of it.

Do you have additional info on why the allocator does not return 16bytes-aligned memory ? Any good reason for it (that would help approach this the right way) ?

akioCL commented 1 year ago

Do you have additional info on why the allocator does not return 16bytes-aligned memory ? Any good reason for it (that would help approach this the right way) ?

It was never suppose to return 16 byte aligned memory. The current allocator is just doing that by pure coincidence (in most cases). Depending on what has been allocated, the current allocator may return a different alignment. Also, depending on the bufferpool or the platform, we may use staging buffers for mapping, which is one of the reason we don't guarantee alignment,.
The new allocator is just allocating where it can, since we didn't specify any alignment.

@moudgils do you have any idea on how we can handle this?

JulienPKFX commented 1 year ago

Hi, OK it was never supposed to return aligned memory, but, we wonder: why ? What is the point of not returning aligned memory? Is there a minimum guaranteed alignment? 8 bytes? 4 bytes? 1 byte?

Making an allocator return aligned memory is pretty trivial on the allocator side, compared to adding wrappers user-side, and it allows "user code" that uses the memory to be improved, simplified, and be SIMD-ified nicely. All the engines PopcornFX has ever been integrated into provide properly aligned memory for anything graphics & GPU releated, hence the asserts. We're not even talking about cache-line alignment here, just basic 16-bytes alignment. Also, unaligned vector stores on NEON platforms are horrendous. Here basically the options we see:

1- Fix the allocator (ok technically it was never "broken" :) but imho, an allocator that returns non-aligned memory for vertex buffers, even if it's staging buffers, is a broken design, which will cause non-optimal codepaths at multiple levels, except if the code using it chooses to ignore anything performance related) So: just make the allocator return at least 16-bytes aligned memory (32-bytes aligned or cache-aligned pointers might be nice for AVX or copies in general but ok it's arguable and we don't really care on the VFX side for that part)

2- We implement low-performance fallbacks everywhere in the vertex-buffer filling code. IE: naive scalar, non SIMD versions. This will degrade perfs of preparing all the VFX data for rendering. Also this is error-prone because some rarely taken code-paths might be missed. This has the additional downside of making the performance unpredictable. If by luck the allocator returns aligned memory, the fast path will be taken and performance will be OK. If it returns unaligned memory, the slow path will be taken. This can change from frame to frame or even between buffers I suppose, so timings will be all over the place and the VB filling times will make the frame time jitter randomly.

We also won't simply replace the non-temporal aligned stores we're currently doing with unaligned stores: As mentioned before, this will degrade perfs too much for everyone else, esp. on mobile. Some engines & platforms also give us pointers to write-combined / uncached memory.

3- We duplicate all the existing code and when the pointer is not aligned, replace the NTA or Aligned16 stores by unaligned stores. This will kinda hide the problem under the carpet, has the same drawbacks as 2/ except perf will be much less digusting on PC (so less visible in a dev environment), but it becomes a maintenance nightmare worse than 2/ with big chunks of code duplicated all over the place, just because the new O3DE graphics allocator doesn't return aligned memory ? Not even talking about I-cache considerations of all that code duplication here.. it just seems like a bad option all over the board.

4- Maybe there is a way to make an allocator wrapper in the gem on top of the new allocator so we re-align the pointers before sending them to the SIMD VB-filling functions. But this probably requires some more trickery on the O3DE side to properly re-offset the VBs before drawing? and I personally don't know the details of O3DE enough to know if that's even feasible. If we have to do this, it seems way, way more reasonable to just to 1/ instead.

From our point of view, the clear best option is 1/ -> make the allocator return aligned memory (16, 32, cache-line, whatever, but at least 16 to play along nicely with basic SIMD). It will have the potential to improve other parts of the engine, not just the popcorn gem. Also the memory overhead of 16-bytes alignment can be daunting for a general purpose allocator where you could make a lot of tiny allocs, but here if I understand correctly we're talking about an allocator that allocates buffers for VBs, textures, etc... so the overhead should just be insignificant noise.

akioCL commented 1 year ago

Hi, OK it was never supposed to return aligned memory, but, we wonder: why ? What is the point of not returning aligned memory? Is there a minimum guaranteed alignment? 8 bytes? 4 bytes? 1 byte?

It returns aligned memory, just not at 16 byte boundary. We have no way of knowing that you want to map to a 16 byte aligned address.

Making an allocator return aligned memory is pretty trivial on the allocator side, compared to adding wrappers user-side, and it allows "user code" that uses the memory to be improved, simplified, and be SIMD-ified nicely.

Our allocator does support aligned memory. You can specify it when creating the buffer, but that's for the buffer itself not for the mapping operation.

All the engines PopcornFX has ever been integrated into provide properly aligned memory for anything graphics & GPU releated, hence the asserts. We're not even talking about cache-line alignment here, just basic 16-bytes alignment. Also, unaligned vector stores on NEON platforms are horrendous.

As as said before, we do provide alignment when creating buffers.

1- Fix the allocator (ok technically it was never "broken" :) but imho, an allocator that returns non-aligned memory for vertex buffers, even if it's staging buffers, is a broken design, which will cause non-optimal codepaths at multiple levels, except if the code using it chooses to ignore anything performance related) So: just make the allocator return at least 16-bytes aligned memory (32-bytes aligned or cache-aligned pointers might be nice for AVX or copies in general but ok it's arguable and we don't really care on the VFX side for that part)

As far as I know there's no such thing as "alignment for vertex buffers", so we cannot force all mapping operations to return a 16 byte alignment. Also, mapping is not only for vertex buffers. Assuming that mapping a buffer is always going to return a memory aligned to 16 byte is a mistake in my opinion.

We also won't simply replace the non-temporal aligned stores we're currently doing with unaligned stores: As mentioned before, this will degrade perfs too much for everyone else, esp. on mobile. Some engines & platforms also give us pointers to write-combined / uncached memory.

A solution that we could do is PKFX creates their vertex buffers with an alignment of16 byte (you can specify it in the RHI::BufferDescriptor). Then, when mapping those buffers we could respect the buffer alignment (16 byte in this case) even when using a staging buffer (we use a staging buffer aligned to 16 byte). So this way we avoid forcing an alignment for all mapping operations (which is a waste of memory) but we also provide a way to get a mapping to a specific border.

JulienPKFX commented 1 year ago

As far as I know there's no such thing as "alignment for vertex buffers", so we cannot force all mapping operations to return a 16 byte alignment

Yes I understand it can feel arbitrary. In the allocators we've worked with we saw anything from cache-aligned down to aligned to the smallest SIMD width supported by the platform, but nothing below that like "classic" general-purpose allocators that will typically align to sizeof(void*) or sometimes sizeof(int) on the platform. Nothing to do specifically with specific usages of VBs or textures, more on how that pointer will be used, so just talking about the mapped pointers where the CPU can write to before uploading to GPU memory (or not uploading but writing directly to shared GPU mem on some platforms. anyway...)

A solution that we could do is PKFX creates their vertex buffers with an alignment of16 byte (you can specify it in the RHI::BufferDescriptor). Then, when mapping those buffers we could respect the buffer alignment (16 byte in this case) even when using a staging buffer (we use a staging buffer aligned to 16 byte). So this way we avoid forcing an alignment for all mapping operations (which is a waste of memory) but we also provide a way to get a mapping to a specific border.

Ah well yes if this is possible it would be perfect ! If the mapping operation returns a pointer that respects the alignment you specify when allocating the initial buffer, then sure we'll just ask for cache-aligned VBs and it'll be good :) It also makes more sense on your side to mirror the alignment on the mapped ptr, rather than a more arbitrary "platform-dependent" align based on the smallest SIMD-width sizeof or whatever.

moudgils commented 1 year ago

This solution sounds good. By the way I checked RHI::dx12 implementation and we are globally aligning all staging buffers to 16 bytes which I guess is wasting memory. We could implement the solution mentioned above to dx12 implementation as well to help with the wastage.

akioCL commented 1 year ago

Created a small PR to add the necessary alignment for the buffers https://github.com/PopcornFX/O3DEPopcornFXPlugin/pull/52 We are getting ready to push the new memory allocator for Vulkan on the O3DE side, so this need change needs to be in before it (it shouldn't create any issues if used with the current memory allocator of O3DE)

HugoPKFX commented 1 year ago

@akioCL thanks ! Your PR was merged into the development branch of the plugin and will be available in PopcornFX v2.15.9, v2.16.4 released by the end of this week. Closing the issue

PopcornFX / O3DEPopcornFXPlugin

PKFX rendering tasks triggers assert for misaligned memory address #45