Closed danielealbano closed 2 years ago
This pull request introduces 1 alert when merging d981c47de19319305770a08ad8a011a8c6aa3e2b into 717f243fa333f040c02fb85b6a46ca04cad80491 - view on LGTM.com
new alerts:
Merging #162 (07019c2) into main (717f243) will increase coverage by
0.28%
. The diff coverage is95.85%
.
@@ Coverage Diff @@
## main #162 +/- ##
==========================================
+ Coverage 77.48% 77.77% +0.28%
==========================================
Files 93 94 +1
Lines 6117 6190 +73
==========================================
+ Hits 4740 4814 +74
+ Misses 1377 1376 -1
Impacted Files | Coverage Δ | |
---|---|---|
src/config.c | 98.46% <ø> (ø) |
|
...ta_structures/hashtable/mcmp/hashtable_op_delete.c | 100.00% <ø> (ø) |
|
...data_structures/hashtable/mcmp/hashtable_op_iter.c | 100.00% <ø> (ø) |
|
.../data_structures/hashtable/mcmp/hashtable_op_set.c | 100.00% <ø> (ø) |
|
...ctures/small_circular_queue/small_circular_queue.c | 90.47% <ø> (ø) |
|
src/network/channel/network_channel_iouring.c | 100.00% <ø> (ø) |
|
src/network/channel/network_channel_tls.c | 1.24% <0.00%> (ø) |
|
src/network/network_tls.c | 72.22% <ø> (ø) |
|
.../protocol/prometheus/network_protocol_prometheus.c | 91.93% <ø> (ø) |
|
...redis/command/network_protocol_redis_command_del.c | 82.02% <ø> (ø) |
|
... and 25 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 717f243...07019c2. Read the comment docs.
The current implementation of the slab allocator uses per-core cache and this requires synchronization to ensure that multiple thread operating on the same core don't get context switched meanwhile they are allocating or freeing memory.
cachegrand uses 1 thread per core, therefore having an ad-hoc cache per-core, which would allow multiple threads to share the local cache and reduce the memory waste, isn't really useful, therefore to improve the performances the slab allocator has been converted to a per-thread cache model.
This new model, though, faces new challenges as the per-thread cache can't be accessed by external threads anymore and if the thread A allocates some memory that the thread B ends-up owning and therefore freeing the thread B can't operate on the internal data structure of the slab allocator owned by the thread A.
To resolve this problem an mpmc queue has been introduced in the slab_allocator, everytime the thread B will need to free up memory for the thread A will push the slab_slot to free up onto that queue and the thread A, when will need to allocate memory, will check as well that queue before requesting a new hugepage. The queue is checked after the local cache is checked as fetching from the queue requires atomic operations / full memory fences and therefore it's slower. At the shutdown the thread A will cleanup this queue before cleaning up the slab slices returning the hugepages to the central cache.
This queue though introduces an additional problem, if the thread A terminates and the thread B has to return the memory, this memory will get lost becase the thread A will not track it anymore, even on top of this the thread A will free up the control structures needed by the thread B! To avoid this issue, the slab allocator will now post-pone freeing up ALL the local cached memory until all the objects have been freed (locally or pushed to the queue) and, when that happens, the thread pushing the last item will take care of cleaning up.
This last bit can be improved introducing a spinlock to synchronize the access to the slab allocator data structure after the owning thread has been terminated but it's not a problem in the current model because the threads terminate all together and all at the end.
Another small improvement is that the mpmc queue has been used for the hugepages cache replacing the double linked list + spinlock, simplifying considerably the code.
The PR includes a number of minor fixes and improvements.
Dropping the spinlock overall provides a general performance improvement and reduces the cpu cycles wasted waiting on it, even with just one thread locking it the difference is noticeable.
Here a more generalized comparison that includes the standard malloc provided in ubuntu and tcmalloc
Here a zoom in on the allocations from 16 to 1024 bytes
All the data are available here https://docs.google.com/spreadsheets/d/1c3vEXq5JU4h_uxS1wGjiJkzSqfiuPrs8wcTYKLpFz7s/edit?usp=sharing