LLNL / Umpire

An application-focused API for memory management on NUMA & GPU architectures
MIT License
318 stars 51 forks source link

Avoid overallocation when underlying allocation is guaranteed to be sufficiently aligned #881

Open msimberg opened 5 months ago

msimberg commented 5 months ago

Is your feature request related to a problem? Please describe.

The underlying allocator may have sufficient alignment, but aligned_allocate always overallocates to guarantee the alignment, even if it may not be necessary: https://github.com/LLNL/Umpire/blob/45159e8be42c3e4e409be707a5c3159fd243886a/src/umpire/strategy/mixins/AlignedAllocation.inl#L25. This wastes (a bit of) memory, and may cause performance issues with some MPI libraries.

Describe the solution you'd like

Avoid overallocation if the underlying allocator provides sufficiently aligned allocations.

Describe alternatives you've considered

Allow controlling alignment of backing buffers separately from alignment of user-facing allocations. The latter should probably never be larger than the former.

Additional context

This is really a feature request that comes from investigating what may be a bug in Cray MPICH, but I wanted to report it here as well since I think Umpire could in some situations do a better job (or I'm simply unaware of the knobs that Umpire has for controlling this, so looking for input in any case).

In our application we use Umpire's QuickPool to pool allocations of GPU buffers. QuickPool will use aligned_allocate to allocate backing buffers from e.g. CUDA, but if I ask for a 1 GiB buffer QuickPool will allocate 1 GiB plus alignment (16 by default) to guarantee that the allocation is aligned. It turns out that when using GPU-aware MPI communicating a buffer whose size isn't page-aligned (I think this is the requirement, but I'm still looking into the details) performance drops considerably. I'm separately reporting this issue to HPE.

I could set the alignment of the QuickPool to the page size to get an appropriately sized backing buffer, but if I understand correctly then all allocations on top of that will also have page-sized alignment, which is excessive for small allocations and can end up wasting a lot of memory. From what I can tell DynamicPoolList behaves the same as QuickPool (is there a reason to prefer one or the other by the way?).

Is there a way already to control the alignment of the backing buffers and "real" allocations on top of it separately? Is there another pool that we could use to get the behaviour we want?

Just out of curiousity since I couldn't find it, where is the code for ensuring that a QuickPool allocation starts at the correct alignment? I see the size is adjusted here: https://github.com/LLNL/Umpire/blob/45159e8be42c3e4e409be707a5c3159fd243886a/src/umpire/strategy/QuickPool.cpp#L46. Edit: I realize this probably happens by construction. If the backing buffers have sufficient alignment and all the allocations have aligned sizes they'll be guaranteed to start aligned as well.

Thanks for your help!

msimberg commented 1 month ago

Ping. Just checking if this is something interesting to you? I may be inclined to attempt implementing one of the options above if it sounds good to you. We're currently still stuck with the workaround where we have to overallocate all allocations by 2 MiB (the large page size on Grace CPUs).

davidbeckingsale commented 1 month ago

@msimberg we would definitely be interested in a fix to avoid over-allocating the underlying buffers, and if it would be useful, adjusting the pool to take two alignment parameters (one for the allocations, and one for the buffers). Thanks!