Open edwin0cheng opened 3 weeks ago
It seem like the memory leak is related to "Active(anon)", but IIUC Anonymouse memory are only used for heap, stack and anonymous mmap, which will be reclaimed after program exit.
IIRC "Active(anon)" does not need to be freed immediately after the process exits, but will be cleaned by kernel after a while. Anw, I don't think this is a llama.cpp-related problem. You should probably review & understand how linux manages virtual memory: https://blogs.oracle.com/linux/post/understanding-linux-kernel-memory-statistics
LRU lists
Currently, the kernel maintains an active list and an inactive list for both page types.When the system’s memory is below a watermark threshold, the kernel starts to scan the tails of inactive lists to reclaim pages which are likely to be idle for a while. If an inactive list becomes short, the kernel scans its corresponding active list at the tail to deactivate pages and move them to the inactive list.
You can see MemFree
is dropped from 26684720 to 17215096, which according your document:
MemFree is the amount of free, unused RAM
And the section you are mentioned, IIRC is for Used Memory, you can confirm by the following paragraph (which stated that the in used anonymous page will be swapped) :
Among the two types of pages, page cache is easier to be reclaimed. A page cache page can be directly freed if not dirty. Otherwise, a write-back operation is needed. However, reclaiming an anonymous page requires to save the page to swap space.
In fact, the first time I found this bug were system hang by no memory available. But maybe you are right, it is not related to llama.cpp. I guess maybe related to CUDA...
I think I observe the same problem with llama-cli
on Linux. The consequence is that I can only rerun it 3–4 times before my system runs out of memory. System monitors fail to attribute this usage to any process.
I also don't know if the cause lies in llama.cpp
or NVIDIA's code.
I have same issue too.
Before starting llama.cpp:
$ free -h
total used free shared buff/cache available
Mem: 62Gi 1,1Gi 60Gi 27Mi 1,1Gi 60Gi
Swap: 63Gi 0B 63Gi
After exiting:
$ free -h
total used free shared buff/cache available
Mem: 62Gi 41Gi 421Mi 18Mi 20Gi 20Gi
Swap: 63Gi 758Mi 63Gi
Smem totals:
PID User Command Swap USS PSS RSS
-------------------------------------------------------------------------------
67 18 462.9M 601.2M 661.8M 980.2M
My system:
$ uname -a
Linux nixie-farm 6.11.4 #1-NixOS SMP PREEMPT_DYNAMIC Thu Oct 17 13:27:02 UTC 2024 x86_64 GNU/Linux
$ nvidia-smi
Thu Oct 24 02:07:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 555.58.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:05:00.0 On | N/A |
| 0% 44C P8 18W / 165W | 41MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 12298 G ...iz78r52md-xorg-server-21.1.13/bin/X 36MiB |
+-----------------------------------------------------------------------------------------+
Could someone please try these and see what's the result?
--no-mmap
argument-ngl 0
-ngl 0
, a CUDA build will still load CUDA library when it runs)I just tried it:
* Add `--no-mmap` argument * With `-ngl 0`
Still memory leak.
* Maybe with a CPU-only build? (even with `-ngl 0`, a CUDA build will still load CUDA library when it runs)
./llama-server -m models/Llama-3.2-3B.Q4_K_M.gguf
build: 3933 (f010b77a) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 24
system_info: n_threads = 8 (n_threads_batch = 8) / 24 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
No memory leak at all.
I also have memory leak with --no-mmap
and -ngl 0
$ free -h
total used free shared buff/cache available
Mem: 62Gi 39Gi 21Gi 13Mi 1,8Gi 22Gi
Swap: 63Gi 531Mi 63Gi
OK so based on these results, we can reduce the scope of the issue to CUDA-only (happens even with -ngl 0
)
Since this is outside of my knowledge, I'd like to ask if you could help debugging this further? @slaren @JohannesGaessler
You could try setting the environment variable GGML_CUDA_NO_PINNED
to see if it is related to using host memory, but ultimately, llama.cpp is a user mode application and it cannot keep resources beyond the lifetime of the process, so whatever is happening, it must be at the kernel level. If you think that this may be a CUDA driver bug, the best course of action would be to report this to NVIDIA. If they confirm the issue and suggest a workaround until it is fixed in the driver, we can try implementing it, but otherwise there is not much we can do about it.
`GGML_CUDA_NO_PINNED = 1
fixed the leakage, so it is related to using host memory i guess.
I found this post in nvidia developer forums, it seem like a bug related in the kernel introduced in 6.11 kernel.
Have the same issue:
SPECS:
4 * T4 with ~64GB VRAM, 440GB RAM, 64 CPU
Running cpp-server built off release b3655 on docker with cuda 12.4.
Running Mistral-Nemo-Instruct-2407-Q6_K_L.gguf with np 2, cont batching, ngl 99 and flash attention.
Even after 30 mins post generation the memory isnt released.
GGML_CUDA_NO_PINNED = 1
did not work me.
Screenshot attached
Any inputs would be much appreciated !
I am not seeing any discrepancies with Linux 6.6.54-2-MANJARO, drivers v550.120-1, CUDA v12.6.1-1, and the latest llama.cpp master commit.
FYI:
According to this linux kernel dev email thread, I tried the following commands to test whether it is related:
# before
> grep foll /proc/vmstat
nr_foll_pin_acquired 5344
nr_foll_pin_released 5344
# after
> grep foll /proc/vmstat
nr_foll_pin_acquired 2173154
nr_foll_pin_released 76002
Bingo! So I will close this issue when this patch lands in the future Linux kernel.
Since the patch isn't applied in 6.12 nor mainline, I've filed a bug to make sure it's not lost: https://bugzilla.kernel.org/show_bug.cgi?id=219427
What happened?
I found that after running :
the memory used is increased a lot:
Note that the
buff/cache
are not increased.And here are `/proc/meminfo' if needed:
`cat /proc/meminfo` before `llama-server`
```log MemTotal: 65579952 kB MemFree: 26684720 kB MemAvailable: 32084964 kB Buffers: 3228 kB Cached: 6354208 kB SwapCached: 0 kB Active: 33631472 kB Inactive: 4406440 kB Active(anon): 32120904 kB Inactive(anon): 0 kB Active(file): 1510568 kB Inactive(file): 4406440 kB Unevictable: 17308 kB Mlocked: 17308 kB SwapTotal: 4194300 kB SwapFree: 4194300 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 96 kB Writeback: 0 kB AnonPages: 3386220 kB Mapped: 1475556 kB Shmem: 436000 kB KReclaimable: 159404 kB Slab: 341540 kB SReclaimable: 159404 kB SUnreclaim: 182136 kB KernelStack: 21568 kB PageTables: 45092 kB SecPageTables: 1052 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 36984276 kB Committed_AS: 9430624 kB VmallocTotal: 34359738367 kB VmallocUsed: 139168 kB VmallocChunk: 0 kB Percpu: 14208 kB HardwareCorrupted: 0 kB AnonHugePages: 499712 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 2768896 kB FilePmdMapped: 354304 kB CmaTotal: 0 kB CmaFree: 0 kB Unaccepted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1957432 kB DirectMap2M: 39720960 kB DirectMap1G: 25165824 kB ````cat /proc/meminfo` after `llama-server`
```log MemTotal: 65579952 kB MemFree: 17215096 kB MemAvailable: 22615424 kB Buffers: 3228 kB Cached: 6354176 kB SwapCached: 0 kB Active: 43082444 kB Inactive: 4406484 kB Active(anon): 41571876 kB Inactive(anon): 0 kB Active(file): 1510568 kB Inactive(file): 4406484 kB Unevictable: 17308 kB Mlocked: 17308 kB SwapTotal: 4194300 kB SwapFree: 4194300 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 1104 kB Writeback: 0 kB AnonPages: 3400176 kB Mapped: 1475480 kB Shmem: 435924 kB KReclaimable: 159484 kB Slab: 332816 kB SReclaimable: 159484 kB SUnreclaim: 173332 kB KernelStack: 21632 kB PageTables: 45088 kB SecPageTables: 1052 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 36984276 kB Committed_AS: 9429912 kB VmallocTotal: 34359738367 kB VmallocUsed: 139200 kB VmallocChunk: 0 kB Percpu: 14208 kB HardwareCorrupted: 0 kB AnonHugePages: 499712 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 2768896 kB FilePmdMapped: 354304 kB CmaTotal: 0 kB CmaFree: 0 kB Unaccepted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1973816 kB DirectMap2M: 40753152 kB DirectMap1G: 24117248 kB ```It seem like the memory leak is related to "Active(anon)", but IIUC Anonymouse memory are only used for heap, stack and anonymous mmap, which will be reclaimed after program exit.
So I don't know how it is possible to leak.
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes version: 3933 (f010b77a) built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output