[Enhancement] Reduce internal memory fragmentation significantly

CLFutureX commented 1 week ago

Background

Currently, Automq has implemented a local log cache, and due to the long residence time of such caches in memory, it has chosen to develop its own memory management based on Netty. Based on the current level of concurrency, multiple memory allocation areas are divided, each corresponding to a HugeBuf, and each will only correspond to one valid HugeBuf. When the HugeBuf is insufficient for further allocation, it will be released and a new HugeBuf will be created. This approach provides good concurrency and allows for timely release of chunks; however, the issue is that in scenarios with large-size messages, it may lead to an increased internal fragmentation rate. For example, if a 1MB bytebuf needs to be allocated and after hash calculation, the corresponding partition's HugeBuf has only 0.9MB remaining, this will result in nearly a quarter of the memory being fragmented.

As memory is a precious resource, we should optimize this.

Netty creates 2*cores memory areas by default, with each memory area managing the allocated chunks through chunkList to arrange and manage chunks of different usage rates, thereby reducing memory fragmentation issues. I think we can draw on this idea to manage chunks of different usage rates to reduce memory fragmentation issues.

Differences:

It can be seen that the difference between Automq and Netty is that Netty manages chunks within a single area. Automq, however, needs to manage the usage rates of different areas because each area will only have one valid HugeBuf at any given time. Therefore, for the Automq scenario, we need to pay extra attention to concurrent performance.

Design

● Set a size threshold to distinguish between large and small messages: thresholdSize. ● Set a HugeBufList to manage HugeBufs of different usage rates. ● And set three memory usage rate levels: ○ q000: 0 <= usage <= 0.5; ○ q050: 0.5 < usage <= 0.75; ○ q075: 0.75 < usage <= 1; (Add a switch to support scenarios with less than three concurrent levels, for scenarios with less than three concurrent levels, allocation will always be made from q000.)

thresholdSize & Concurrency

● The purpose of setting thresholdSize is mainly to improve concurrency. ● When the memory to be allocated is < thresholdSize, it can be allocated from the three different usage rate lists; ● At the same time, after routing to the corresponding HugeBufList, a subscript will be calculated based on hashCode, and the allocation will start from this subscript to avoid always starting from the first one, causing conflicts to be concentrated at the front.

Memory Fragmentation:

● To reduce internal memory fragmentation, for allocations less than thresholdSize, prioritize allocation from those with higher usage rates, thus setting different weights q075(3) -> q050(2) -> q000(1). ● When the memory to be allocated >= thresholdSize, prioritize allocation from q050, if allocation fails, then allocate from q000; ● If allocation still fails in q000, then force allocation from q075. Why force allocation from 075 instead of creating a new HugeBuf after allocation fails from 000 at the end? ● It is also to reduce memory fragmentation, so choose to eliminate a HugeBuf in the 075 range.

HugeBuf Transfer:

● After each allocation, calculate the current usage situation and transfer. ● Under normal circumstances, we only need to consider transferring forward: q000 -> q050 -> q075; ● But in special cases, if this HugeBuf is newly created, it needs to be transferred forward or directly added to q000, and the choice here is to directly add it to q000, because just allocated memory, after being used once, is basically in the q000 range. Throughout the process, to reduce concurrent conflicts, the following optimizations have also been made:

When selecting HugeBuf from HugeBufList, add locks at the HugeBufList granularity when adding or removing HugeBuf; the rest are basically at the HugeBuf granularity;
When transferring HugeBuf, the first time to obtain the remaining capacity, choose to use tryLock, because if the lock fails at this time, it means that another thread is allocating from this hugeBuf, so choose to return directly, because the thread holding the lock will also perform this process next.
Test:

Currently, I have written a simple test case: com.automq.stream.ByteBufSeqAllocTest#testAllocCompare Test with 50 threads, 1000 write operations, randomly allocated according to weights: 5 << 20 weight 5 1 << 20 weight 10 5 << 10 weight 35 2 << 10 weight 50 And record the total memory fragmentation, average fragmentation, and elapsed time for two different methods. Test results: Fragmentation can be reduced to 1/4, average fragmentation can be reduced to 1/3, and the elapsed time is almost the same.

oldSize: 38 , newSize: 28 oldSum: 32905037, newSum: 6645039 oldAvg: 865922.0263157894, newAvg: 237322.82142857142 oldCostTimeAvg: 0.824, newCostTimeAvg: 0.475

this is just a simple test. If feasible, further optimization will be carried out later.

CLFutureX commented 1 week ago

The current linked PR is an informal PR, provided for testing purposes.

CLFutureX commented 1 week ago

@superhx @Chillax-0v0 hey, please take a look

AutoMQ / automq