ronawho commented 3 years ago

Today we have 3 different memory allocation modes:

simple: just use jemalloc as a replacement for the system allocator
dynamic-heap: large arrays are allocated through the comm layer. Anything else is done with dynamic extensions to a heap. there are jemalloc chunk hooks that ask the comm layer for more memory as needed for the dynamic extensions.
fixed-heap: we get a blob of registered memory from the comm layer and we provide jemalloc with chunk hooks to satisfy allocations from that blob

For the simple case fragmentation is handled entirely by the memory layer and jemalloc is typically really good at fragmentation avoidance. For the dynamic heap large arrays are separately allocated and free'd back to the OS so fragmentation isn't much of an issue here either. Non-large array allocations suffer slightly from fragmentation because we don't provide hooks for jemalloc to return memory or to split/merge existing chunks but this isn't bad.

The fixed heap can lead to severe fragmentation. What we provide is effectively a sbrk like interface where we can only bump a pointer into this fixed heap to satisfy new allocations but we're not providing jemalloc deallocation or merge/split hooks. This can result in cases where if you allocate a 100G array and then a 101G array that first 100G region can't be reused. Similarly if you allocate a 100G array and then 2 50G arrays we can't split that first 100G so 50G is wasted. Additionally since large allocations and small allocations are intertwined we can get additional fragmentation.

We should look at providing merge/split hooks so that jemalloc can better manage existing chunks. It may also be worthwhile to satisfy small allocations starting at the top of the heap and larger ones from the bottom to limit fragmentation between long-lived small allocations (e.g. task stacks) and larger arrays. We may also be able to provide deallocation hooks, though I'm not sure how necessary that is if we have the merge/split functionality.

ronawho commented 3 years ago

https://github.com/chapel-lang/chapel/pull/18299 improved the fragmentation issue by allowing jemalloc to merge/split chunks. Unfortunately, this hurt performance for applications that do varying dynamic allocations like Arkouda -- https://chapel-lang.org/perf/arkouda/16-node-cs-hdr/?startdate=2021/08/19&enddate=2021/09/03&configs=nightly&suite=benchmarks.

What's happening now is that we're reusing memory, but that memory already has its NUMA affinity set so we can end up with cases where the NUMA affinity for a region of memory is pretty suboptimal. e.g. if you have 2 sockets operating on a 10GB array each socket will zero half the array: socket 0 will touch and fault-in the first 5GB and socket 1 will touch and fault-in the last 5 GB. If we then go to allocate a 20GB array each socket will zero half the array: socket 0 will touch the first 10GB, but that's already been faulted in so NUMA affinity does change, and socket 1 will zero the last 10GB, which will fault it in. This will result in the first 5GB having affinity to socket 0 and the last 15GB having affinity to socket 1. So we'll have a disproportionate amount of memory accesses going to socket 1 and in general the NUMA affinity isn't what our arrays is expecting it to be.

For now we're going to add an option to interleave memory allocations, which will round-robin pages across sockets. This is similar to what we did in https://github.com/chapel-lang/chapel/pull/17405, but we'll do it at allocation time instead of startup time. This will hurt best case performance, but will also limit the worst case. Half our our memory references will be remote so NUMA affinity isn't great, but we're at least spreading the load out across the sockets.

Longer term we'll probably want to do separate array allocations that are entirely free'd back to the system like we do for ugni, but the above is what we're planning to do as a stopgap for the 1.25 release. In addition we're also talking about adding support for running with a process per socket or NUMA domain.

ronawho commented 3 years ago

09&configs=release,nightly&suite=benchmarks

Generally speaking performance is still behind the old peak, but most operations are still ahead of where they were with 1.24.1.

chapel-lang / chapel

Using a fixed heap can lead to fragmentation #18286

18350 added support for interleaving large memory allocations. This has reduced the performance impact on Arkouda -- https://chapel-lang.org/perf/arkouda/16-node-cs-hdr/?startdate=2021/08/03&enddate=2021/09/09&configs=release,nightly&suite=benchmarks