Add benchmark against rpmalloc / lockfree-malloc?

twmb commented 7 years ago

It looks like rpmalloc has benchmarks showing it surpasses lockless performance, but I'm not sure if it's the same lockless in your comparisons.

I like the elfmalloc graphs because they show multiple aspects of every allocator in various scenarios.

If it's easy enough, can you throw a few lock free allocator benches into the performance comparisons?

ezrosent commented 7 years ago

Thanks for the link! I hadn't heard of rpmalloc before and I'm excited to see what we can learn from the work they have done there.

I'll keep this issue open until I have time to integrate rpmalloc into our benchmarking setup. A couple of points though:

Except at thread startup and shutdown (where we cannot rely on crossbeam for non-blocking data-structures -- that dependency will go away once we can stop relying on crossbeam TLS), elfmalloc is lock-free. In fact, a wait-free variant of elfmalloc with essentially the same performance would be about a 1-line change for us (it has to do with how we scaffold the FAAArrayQueue inner loop).

I haven't exactly advertised this portion of the project because because I think that most of the time people are interested in a memory allocator for performance, rather than progress guarantees. In practice, I also suspect that software that implements algorithms without relying on locks is much more likely to be buggy than code that relies on locks. That proviso absolutely applies to elfmalloc. Do you think it's worth writing up why we chose a non-blocking design for elfmalloc?

twmb commented 7 years ago

I think it's worth writing about. It's interesting, and it also means that there should hopefully be lower tail latencies due to no lock contention.

I read the Michael / Scott paper and the four variants on concurrencyfreaks - (still have the hazard pointer one and Yang / Crummey and Morrison ones to go).

I haven't dug in completely yet (there's a lot of code), but I couldn't see if there was something done to avoid the bumping problem concurrencyfreaks descriped in their next post.

Would a situation where the allocation pattern is heavier on allocation while frees are happening slower, followed by bursts of frees, be bad? In this situation, the heavier consumption would trigger the bumping problem they describe. The bursts of frees would keep the allocation level, long term, not leaky.

Also, I saw the original C++ code used .load() and .store() directly on the atomic pointers, which defaults to SeqCst - it might be worth it to add some commentary for why relaxed is fine (and the reason for the Acquire fence). I'll think more about that shortly.

ezrosent commented 7 years ago

I think it's worth writing about. It's interesting, and it also means that there should hopefully be lower tail latencies due to no lock contention.

Ah, good point. I am curious to see if there's an impact in that setting. A lot of the benchmarks we are using are off-the-shelf malloc benchmarks, and they are mostly throughput-focused. Tail latency is definitely something worth looking at though! I think that would be a good setting to also talk about our other reasons for taking this approach (mainly, for fun/learning, and also reducing dependencies on OS-specific constructs).

Would a situation where the allocation pattern is heavier on allocation while frees are happening slower, followed by bursts of frees, be bad? In this situation, the heavier consumption would trigger the bumping problem they describe. The bursts of frees would keep the allocation level, long term, not leaky.

Good question! I think there are a couple things that help us in bursty scenarios.

One general point is to keep in mind how we use these queues. The slab subsystem naturally builds in a form of batching for us here. One push or pop can be enough for several KB or MB of memory allocated. Even in a situation where we believe the underlying algorithm is suboptimal for a particular workload, it may be the case that a single successful push or pop need only occur every thousand alloc or free calls (though this effect is attenuated for larger allocations).
Furthermore, we wrap an instance of the FAAArrayQueue in a BagPipe. It tries its best to make sure that push and pop calls are sharded across several instances of a given queue. In addition, the BagPipe handles load balancing for push and pop by having them "move" around the array of queues (meaning that they start their search at a different place at the next call) when the operation fails (e.g. if someone is "bumped out"). A bursty scenario where there are lots of failures means that there will be lots of retries and migration; decreasing the chances of a particular group of threads converging on one queue in particular and causing mayhem.

Note Your particular example does suggest a possible optimization. What if we only "migrated" threads that fail push operations? In the bursty scenario, the poping threads would stay put (and fail more often) but the pushing threads would naturally migrate to queues that are less contended by dequeuers. There wouldn't be starvation as push and pop will eventually visit all available queues in the BagPipe, but dequeueing threads would focus on fixed queues more, giving enqueuers an edge.

We try not to linger on a particular BagPipe for very long; if too many transient failures like the "bumping out" process occur for allocating (i.e. poping) threads we fall back on other allocation avenues. I think this would naturally diffuse bad situations like the one you described.

Also, I saw the original C++ code used .load() and .store() directly on the atomic pointers, which defaults to SeqCst - it might be worth it to add some commentary for why relaxed is fine (and the reason for the Acquire fence). I'll think more about that shortly.

While I did think about the ordering constraints about these when I wrote the code, this stuff is super tricky. I would neither be surprised nor ashamed if there were bugs, so if something seems fishy it probably is! Which load and store calls are you curious about in particular?

ezrosent commented 7 years ago

I updated the performance numbers to include rpmalloc

joshlf commented 7 years ago

Are we good to close this?

twmb commented 7 years ago

Hey I just read the docs commit. I'm surprised that rpmalloc performed poorly on many of those benchmarks - I was expecting it to do a lot better based on the one graph on their README.

Your particular example does suggest a possible optimization. What if we only "migrated" threads that fail push operations? In the bursty scenario, the popping threads would stay put (and fail more often) but the pushing threads would naturally migrate to queues that are less contended by dequeuers. There wouldn't be starvation as push and pop will eventually visit all available queues in the BagPipe, but dequeueing threads would focus on fixed queues more, giving enqueuers an edge.

I can't be positive on this - I don't know the code enough! If my high level understanding is correct, this would stricly be an optimization - the popping behavior right now would be the same as in the new scenario, but the pushing would be optimized to avoid failures.

Which load and store calls are you curious about in particular?

I was looking at the relaxed loads and FAA (with the one acquire fence) in FAAQueueLowLevel's try_pop. I still haven't had time to really dig further into this code base - I've been focusing on the tokio stack in the spare time I have. I'll get back to this eventually, though, since these are the types of optimizations that interest me.

Are we good to close this?

The new graphs definitely address my original issue, and they're great! Once I get back to digging into this code, I'll raise another issue if I have a question or notice something. Thanks!

joshlf commented 7 years ago

The new graphs definitely address my original issue, and they're great! Once I get back to digging into this code, I'll raise another issue if I have a question or notice something. Thanks!

I'm closing so this doesn't appear in the list of open issues, but feel free to continue the discussion here!

ezrosent commented 7 years ago

I'm surprised that rpmalloc performed poorly on many of those benchmarks - I was expecting it to do a lot better based on the one graph on their README

I don't think our results really contradict those from their README; the main difference in my mind is that the results on the rpmalloc were run on a 4-core/8-thread machine. Our results are run on a 16-core/32-thread machine, and it looks like we are stressing different parts of the algorithm there. rpmalloc is still beating a lot of others at lower core counts; it's worth looking at what they're doing there, but I haven't looked too closely yet.

the relaxed loads and FAA (with the one acquire fence) in FAAQueueLowLevel's try_pop

The function starts with a conservative check of emptiness in the queue. If we could stop time, an easy way to check for emptiness would be to check if head_ix >= tail_ix. We can't load those atomically, so we first check the head and then check the tail, using the Acquire fence to order updates around those two. That is, the fence prevents the load for tail from being reordered before the load for the head. That's important because if we read the tail first, and then read the head and it passed the emptiness check, it's possible some other thread has pushed onto the queue in the meantime, never leaving the queue empt in the interim. There isn't an analogous race when loading the head first, because we only bail out if the tail still hasn't changed.

Now, I think the reason why the C++ implementation uses sequentially consistent operations here is that they care a lot about linearizability. I don't know if this emptiness check is linearizable if we use relaxed ordering everywhere; I suspect it isn't. To fix that, we would at least need to make the loads Acquire and the fetch-adds Release; we may need something stronger though. For our purposes in elfmalloc that's fine, as we don't rely on any strong ordering guarantees. Still, it is probably worth moving to a stricter model as these queues don't appear to be a bottleneck.

ezrosent / allocators-rs

Add benchmark against rpmalloc / lockfree-malloc? #78