OpenMPDK / SMDK

SMDK, Scalable Memory Development Kit, is developed for Samsung CXL(Compute Express Link) Memory Expander to enable full-stack Software-Defined Memory system
271 stars 60 forks source link

SMDK allocator allocates more memory than intended (compatible path) #31

Closed Sangun-Choi closed 2 weeks ago

Sangun-Choi commented 3 weeks ago

Dear SMDK contributors,

I run a very simple program that allocates 5 GB of memory using malloc to create an array, and then frees the array. However, I notice an additional 1 GB memory allocation occurring with SMDK's compatible path. I’m curious why this additional allocation occurs.

The C code is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#define FIVE_GB (5L * 1024 * 1024 * 1024)

int main() {
    printf("Allocating 5GB of memory...\n");
    size_t num_elements = FIVE_GB / sizeof(uint32_t);

    uint32_t *array = (uint32_t *)malloc(num_elements * sizeof(uint32_t));
    if (array == NULL) {
        perror("malloc");
        exit(EXIT_FAILURE);
    }

    printf("1: Memory allocated successfully. Initializing array...\n");

    memset(array, 1, num_elements * sizeof(uint32_t));
    printf("1: Array initialized. Press Enter to free the array.\n");

    getchar();  // Wait for user to press Enter
    free(array);
    printf("1: Array freed. Press Enter to exit.\n");
    getchar();  // Wait for user to press Enter

    return 0;
}

Without loading the SMDK allocator library, the program behaves as expected. It allocates 5 GB of memory and then frees it. I monitor the program's memory usage with numastat. malloc_numastat (5 GB malloc) free_numastat (after freeing)

After loading the SMDK allocator library, the program allocates 6 GB of memory. smdk_malloc

Even after freeing the array, the program continues to use 1 GB of memory. smdk_free

Also, if a program repeatedly performs 5 GB malloc and free, the additional 1 GB allocations are accumulated, and the program's memory usage continues to grow.

SeungjunHa commented 3 weeks ago

Thank you for using SMDK.

The SMDK Allocator is an extension of Jemalloc. In order to speed up memory allocation at the start of the application, Jemalloc starts with the memory chunk (about 1GB and can be set by modify Jemalloc config) allocated before application run.

For this reason, if you start the application through the SMDK Allocator and check the amount of memory used by the application, It looks like your application consumes more than it actually used.

In addition, Jemalloc provides memory allocation service in their memory chunk only for small size objects (2MB in my memory, which can be wrong, so I recommend you to check it again personally), and calls mmap syscall immediately for large size objects (mallocs that want to allocate 5GB at once like your application).

Therefore, in this case, it may appear that about 1GB, which is the memory allocated by Jemalloc for cache, remains.

Sangun-Choi commented 3 weeks ago

Thank you for your comment, and sorry for the late response. I understand why the program seems to be using more memory than I allocated!

While I understand that "jemalloc starts with a 1 GB memory chunk before the application runs," I still wonder why are the 1 GB jemalloc caches are accumulated when I repeatedly allocate and free 5 GB of memory. For example, if a program allocates and frees 5 GB of memory 10 times, the program uses 10 GB as a jemalloc cache even after freeing all the allocated memory. The figure below shows the memory usage of the program after freeing all the allocated memory. 10mallocfree

I am concerned that this might degrade performance if memory capacity is insufficient (or, this might not be critical if the Jemalloc cache is automatically shrunk when the free space is insufficient).

(I believe this issue is related to jemalloc rather than SMDK itself. However, since I haven't modified any jemalloc configurations before or after installing SMDK, I'd like to ask the SMDK contributors about this issue.)

SeungjunHa commented 3 weeks ago

Can I get your test code and check script like your figures printed out Huge, Heap, Stack, Private... (does it using procp//smaps)?

I've run under code in my system as you told, but I got a different result. In my case, there's no accumulated memory while running.

// test.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#define FIVE_GB (5L * 1024 * 1024 * 1024)

int main() {
        for (int i=0;i<10;i++) {
                printf("Allocating 5GB of memory...\n");
                size_t num_elements = FIVE_GB / sizeof(uint32_t);

                uint32_t *array = (uint32_t *)malloc(num_elements * sizeof(uint32_t));
                if (array == NULL) {
                        perror("malloc");
                        exit(EXIT_FAILURE);
                }

                printf("1: Memory allocated successfully. Initializing array...\n");

                memset(array, 1, num_elements * sizeof(uint32_t));
                printf("1: Array initialized. Press Enter to free the array.\n");

                getchar();  // Wait for user to press Enter
                free(array);
                printf("1: Array freed. Press Enter to exit.\n");
                getchar();  // Wait for user to press Enter
        }

    return 0;
}
# run.sh
LD_PRELOAD=SMDK/lib/smdk_allocator/lib/libcxlmalloc.so ./test
watch -n 0.1 and then
$ free -h
or
$ numactl -H
or
$ cat /proc/3498578/smaps | grep -e heap -e stack -e Private -e Huge
Sangun-Choi commented 3 weeks ago

Thank you for testing the code. My code is almost the same as yours. I check the memory usage using the command sudo watch -n 0.2 numastat $(pidof <my_program>).

I have also run the code you provided and found that exporting CXLMALLOC_CONF triggers the issue. If I export LD_PRELOAD only, the memory (jemalloc cache) accumulation does not occur. However, if I export CXLMALLOC_CONF, the memory accumulation does occur.

For example, the following script does not result in the 1 GB memory accumulation. SMDK_malloc_10 is the binary file generated by compiling the code you provided:

#!/bin/bash

export LD_PRELOAD=/SMDK/lib/smdk_allocator/lib/libcxlmalloc.so
export CXLMALLOC_CONF=""
$(pwd)/SMDK_malloc_10
export LD_PRELOAD=""

However, the following script does result in the 1 GB memory accumulation:

#!/bin/bash

export LD_PRELOAD=/SMDK/lib/smdk_allocator/lib/libcxlmalloc.so
export CXLMALLOC_CONF=use_exmem:true,exmem_size:51200,normal_size:51200,maxmemory_policy:remain,use_auto_arena_scaling:false,priority:normal
$(pwd)/SMDK_malloc_10
export LD_PRELOAD=""

If I'm using CXLMALLOC_CONF incorrectly, please let me know!

SeungjunHa commented 3 weeks ago

I ran it in the same with the script you sent me, but in my case, it does not accumulate 1GB... Can I know your system information? For example, It would be nice to have information such as kernel information (uname -r) and numeractl -H. In addition, it seems that you have to operate normally with the CXLMALLOC_CONF you used.

Sangun-Choi commented 3 weeks ago

I am using the 6.9.0-smdk kernel and Ubuntu 24.04 LTS. My server has two NUMA nodes, each equipped with DRAM, and one CXL device (sorry, but I’m unable to provide detailed hardware information about my system).

Meanwhile, I find that this issue might not be related to the memory system but is somehow connected to the Docker container environment. When I run the script (and code) on a bare machine, it works fine, and the additional 1 GB memory allocation and memory accumulation do not occur.

However, when I run the script inside a Docker container, the aforementioned problems arise. I have attached the necessary files (Dockerfile, script, C code) to test the Docker environment. for_SMDK_github_issue.zip

You can run the test by executing the following commands:

cd <path>/for_SMDK_github_issue
docker build -t smdk_test:24.04 . 

chmod +x create_container.sh && ./create_container.sh

(inside the container)

cd examples
chmod +x run_SMDK_malloc_10.sh && ./run_SMDK_malloc_10.sh

If the additional allocation does not occur in that environment, then this behavior might be an issue specific to my setup...

SeungjunHa commented 3 weeks ago

Sorry for my late response. Finally, I repeat this problem in my systems and it is related with use_auto_arena_scaling=false. (Change it to true, then there is no memory leakage.)

Because of another company task, I start to debug this problem from now on, and I will try to fix it as soon as possible then reply again. I'm sorry for replying late again.

Sangun-Choi commented 3 weeks ago

Thank you for your response and help!

SeungjunHa commented 3 weeks ago

The key problem is that it is in the use_auto_arena_scaling config. The use_auto_arena_scaling config was initially designed to automatically determine the number of arenas according to the number of cores. However, during development, logic to select arena was added according to use_auto_arena_scaling config, which caused this problems. (Thank you for discovering.)

Specifically, a function of 69 lines (get_auto_scale_target_arena) and 77 lines (get_normal_target_arena) of https://github.com/OpenMPDK/SMDK/blob/main/lib/smdk_allocator/core/init.c is applied according to use_auto_arena_scaling config. (If true, then get_auto_scale_target_arena, else get_normal_target_arena)

These two are abstracted and used by a function called get_target_arena, which is used to determine which arena to allocate and use memory or which arena to return memory during malloc and free.

The problem is that when the use_auto_arena_scaling config is false, get_normal_target_arena is used, which uses the arena to be used through in round-robin form (pool->arena_index++; in 82 Lines). Therefore, if malloc and free are repeated, it is allocated from the arena 0 and returned to the arena 0. However, next malloc doesn't use the one already allocated of arena 0, but newly allocates to arena 1 and returned to the arena 1. Repated over and over again. Maybe accumulated amount is 20% of allocated memory(Your case 1GB that is 20% of 5GB). The reasons for this need to be analyzed further...

For example, assuming that the number of cores of CPU 0 is 10, if the application that repeats malloc-free 20 times is executed through numeractl --cpunodebind=0, the first 10 times appear to have an accumulated memory, but the next 10 times do not generate an accumulated memory. Because they use the memory allocated in arena 0 again.

For this problem, it is recommended to use the use_auto_arena_scaling config as true. There is no performance difference, it creates arena as many as the number of cores, and statically uses arena corresponding to Core ID when malloc or free.

Thank you for discovering this problems, and we need to discuss internally how to fix this problems. Maybe, this problem will be fixed at next release.

SeungjunHa commented 2 weeks ago

Below figures maybe help you understanding.

  1. use_auto_arena_scaling config is false.

  2. use_auto_arena_scaling config is true.

Sangun-Choi commented 2 weeks ago

Thank you for your clear explanation. The figure is really helpful to understand.

junhyeok-im commented 2 weeks ago

Hello Sangun-Choi!

As SeungjunHa said, thanks to your report my team were able to find a issue and are discussing internally how to fix it. We will release a patch that reflects the issue fix soon.

Can I close your issue now?

Sangun-Choi commented 2 weeks ago

Hi junhyeok-im! Thank you for your response. I will close the issue with this comment.