danielealbano commented 2 years ago

The current implementation of the slab allocator uses per-core cache and this requires synchronization to ensure that multiple thread operating on the same core don't get context switched meanwhile they are allocating or freeing memory.

cachegrand uses 1 thread per core, therefore having an ad-hoc cache per-core, which would allow multiple threads to share the local cache and reduce the memory waste, isn't really useful, therefore to improve the performances the slab allocator has been converted to a per-thread cache model.

This new model, though, faces new challenges as the per-thread cache can't be accessed by external threads anymore and if the thread A allocates some memory that the thread B ends-up owning and therefore freeing the thread B can't operate on the internal data structure of the slab allocator owned by the thread A.

To resolve this problem an mpmc queue has been introduced in the slab_allocator, everytime the thread B will need to free up memory for the thread A will push the slab_slot to free up onto that queue and the thread A, when will need to allocate memory, will check as well that queue before requesting a new hugepage. The queue is checked after the local cache is checked as fetching from the queue requires atomic operations / full memory fences and therefore it's slower. At the shutdown the thread A will cleanup this queue before cleaning up the slab slices returning the hugepages to the central cache.

This queue though introduces an additional problem, if the thread A terminates and the thread B has to return the memory, this memory will get lost becase the thread A will not track it anymore, even on top of this the thread A will free up the control structures needed by the thread B! To avoid this issue, the slab allocator will now post-pone freeing up ALL the local cached memory until all the objects have been freed (locally or pushed to the queue) and, when that happens, the thread pushing the last item will take care of cleaning up.

This last bit can be improved introducing a spinlock to synchronize the access to the slab allocator data structure after the owning thread has been terminated but it's not a problem in the current model because the threads terminate all together and all at the end.

Another small improvement is that the mpmc queue has been used for the hugepages cache replacing the double linked list + spinlock, simplifying considerably the code.

The PR includes a number of minor fixes and improvements.

Dropping the spinlock overall provides a general performance improvement and reduces the cpu cycles wasted waiting on it, even with just one thread locking it the difference is noticeable. Performance comparison with and without the spinlock

Here a more generalized comparison that includes the standard malloc provided in ubuntu and tcmalloc Performance comparison

Here a zoom in on the allocations from 16 to 1024 bytes Performance comparison

All the data are available here https://docs.google.com/spreadsheets/d/1c3vEXq5JU4h_uxS1wGjiJkzSqfiuPrs8wcTYKLpFz7s/edit?usp=sharing

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging d981c47de19319305770a08ad8a011a8c6aa3e2b into 717f243fa333f040c02fb85b6a46ca04cad80491 - view on LGTM.com

new alerts:

1 for Unsigned comparison to zero

codecov[bot] commented 2 years ago

Codecov Report

Merging #162 (07019c2) into main (717f243) will increase coverage by 0.28%. The diff coverage is 95.85%.

@@            Coverage Diff             @@
##             main     #162      +/-   ##
==========================================
+ Coverage   77.48%   77.77%   +0.28%     
==========================================
  Files          93       94       +1     
  Lines        6117     6190      +73     
==========================================
+ Hits         4740     4814      +74     
+ Misses       1377     1376       -1

Impacted Files	Coverage Δ
src/config.c	`98.46% <ø> (ø)`
...ta_structures/hashtable/mcmp/hashtable_op_delete.c	`100.00% <ø> (ø)`
...data_structures/hashtable/mcmp/hashtable_op_iter.c	`100.00% <ø> (ø)`
.../data_structures/hashtable/mcmp/hashtable_op_set.c	`100.00% <ø> (ø)`
...ctures/small_circular_queue/small_circular_queue.c	`90.47% <ø> (ø)`
src/network/channel/network_channel_iouring.c	`100.00% <ø> (ø)`
src/network/channel/network_channel_tls.c	`1.24% <0.00%> (ø)`
src/network/network_tls.c	`72.22% <ø> (ø)`
.../protocol/prometheus/network_protocol_prometheus.c	`91.93% <ø> (ø)`
...redis/command/network_protocol_redis_command_del.c	`82.02% <ø> (ø)`
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 717f243...07019c2. Read the comment docs.

danielealbano / cachegrand

Convert slab allocator from core cache to thread cache #162

Codecov Report