CUDA out of memory - but there's plenty of memory

Energiz3r commented 1 year ago

TLDR: When offloading all layers to GPU, RAM usage is the same as if no layers were offloaded. In situations where VRAM is sufficient to load the model but RAM is not, a CUDA Out of Memory error ocurrs even though there is plenty of VRAM still available.

System specs OS: Windows + conda CPU: 13900K RAM: 32GB DDR5 GPU: 2x RTX 3090 (48GB total VRAM)

When trying to load a 65B ggml 4bit model, regardless of how many layers I offload to GPU, system RAM is filled and I get a CUDA out of memory error.

I've tried with all 80 layers offloaded to GPUs, and with no layers offloaded to the GPUs at all, and the RAM usage doesn't change in either scenario. There is still about 12GB total VRAM free when the out of memory error is thrown.

Screenshot of RAM / VRAM usage with all layers offloaded to GPUs: https://i.imgur.com/vTl04qL.png

Interestingly the system RAM usage hits a ceiling while loading the model but the error isn't thrown until the end of the loading sequence. If I had to make a guess on what's happening I would say llama.cpp isn't doing garbage collection on the buffer contents. When CUDA goes to use some system memory it can't see any as available and so crashes.

E:\llama.cpp release 254a7a7>main -t 8 -n -1 -ngl 80 --color -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first  -m ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin -i -ins
main: build = 670 (254a7a7)
main: seed  = 1686799791
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
  Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  = 10814.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 64 layers to GPU
llama_model_load_internal: total VRAM used: 28308 MB
....................................................................................................
llama_init_from_file: kv self size  = 5120.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.200000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:2342: out of memory

Bonus: Without -ngl set, loading succeeds and I actually get a few tokens worth of inference before CUDA error 2 at D:\AI\llama.cpp\ggml-cuda.cu:994: out of memory is thrown. The model needs ~38GB of RAM and I only have 32GB so I assume it's using swapfile, but with no layers offloaded it's odd that an error still comes from CUDA.

JohannesGaessler commented 1 year ago

I can tell from the log that you are not using the latest master version. There have been substantial GPU changes so please re-do your test with the latest master version.

Energiz3r commented 1 year ago

Edited OP to reflect what happens on the latest commit [254a7a7]

JohannesGaessler commented 1 year ago

I can't reproduce this issue on my machine.

Energiz3r commented 1 year ago

I can't reproduce this issue on my machine.

What are the specs of your machine? Which model did you test with?

JohannesGaessler commented 1 year ago

    ~/Pr/llama.cpp    master *3 ?10    neofetch                                                              ✔  johannesg@johannes-ms7850 
██████████████████  ████████   johannesg@johannes-ms7850 
██████████████████  ████████   ------------------------- 
██████████████████  ████████   OS: Manjaro Linux x86_64 
██████████████████  ████████   Host: MS-7850 1.0 
████████            ████████   Kernel: 6.3.0-1-MANJARO 
████████  ████████  ████████   Uptime: 27 mins 
████████  ████████  ████████   Packages: 1100 (pacman) 
████████  ████████  ████████   Shell: zsh 5.9 
████████  ████████  ████████   Terminal: /dev/pts/2 
████████  ████████  ████████   CPU: Intel i5-4570S (4) @ 3.600GHz 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1050 Ti 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1070 
████████  ████████  ████████   Memory: 362MiB / 15921MiB 
████████  ████████  ████████

    ~/Pr/llama.cpp    master *3 ?10    ./main --model models/opt/llama-${model_size}-ggml-${quantization}.bin --ignore-eos --n_predict 128 --ctx_size 2048 --batch_size 512 --seed 1337 --threads 4 --gpu_layers 32 --mlock | tee chat.txt
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 670 (254a7a7)
main: seed  = 1337
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070
  Device 1: NVIDIA GeForce GTX 1050 Ti
llama.cpp: loading model from models/opt/llama-33b-ggml-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0,13 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce GTX 1070) as main device
llama_model_load_internal: mem required  = 10570,53 MB (+ 3124,00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/63 layers to GPU
llama_model_load_internal: total VRAM used: 9699 MB
....................................................................................................
llama_init_from_file: kv self size  = 3120,00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 128, n_keep = 0

 ← The Writer’s Block: A Video Q&A with Kathleen Duey
The Writer’s Block: A Video Q&A with Shannon Hale →
by keplertalk | September 26, 2012 · 3:14 pm
Blog Tour Kick-Off: The Dark Unwinding by Sharon Cameron
As we have mentioned in the past, we here at KEPLER’S BOOKS LOVE to read. So what better way to spend our days than helping to put great books into people’s hands?
llama_print_timings:        load time = 100207,50 ms
llama_print_timings:      sample time =    89,00 ms /   128 runs   (    0,70 ms per token)
llama_print_timings: prompt eval time =  1473,93 ms /     2 tokens (  736,96 ms per token)
llama_print_timings:        eval time = 103572,14 ms /   127 runs   (  815,53 ms per token)
llama_print_timings:       total time = 105181,02 ms

Energiz3r commented 1 year ago

hmmm. you have 16GB of RAM but only 12GB of VRAM if my guess on those GPUs is accurate. Can you confirm if RAM / VRAM usage aligns with what it should be for the number of layers offloaded?

JohannesGaessler commented 1 year ago

Yes, I can confirm that it works correctly on my machine.

Energiz3r commented 1 year ago

So the RAM usage aligns or no? As I mentioned it would appear to work correctly if your ram capacity wasn't an issue. Any suggestions how else I can test? i've tried a few different models and on different machines and see the same thing in all cases

Energiz3r commented 1 year ago

Saw a new build come through a09f919 - issue persists. If I up my RAM to 64GB it runs fine like you say. But surely when I have 48GB VRAM and the model needs 38GB memory I shouldn't be using any RAM should I?

hmage commented 1 year ago

Agreed, it seems counter-intuitive why would you need RAM if the layers are going to be in VRAM. Why buffer entire model in RAM before passing it to GPU in the first place?

Energiz3r commented 1 year ago

@ggerganov any ideas on this one? I'd rather not have to buy ram to get around a bug 👀 if @JohannesGaessler can't look into this that's what I'll have to do to run any model that isn't fitting into ram

JohannesGaessler commented 1 year ago

I mean, I can't look into it until I know how to reproduce the issue. Right now I'm just waiting for other people to report the same problem to see if there is a pattern.

Energiz3r commented 1 year ago

Sorry @JohannesGaessler all I meant was your test approach isn't going to replicate the issue because you're not in a situation where you have more VRAM than RAM.

eg. If you can reduce your available system ram to 8gb or less (perhaps run a memory stress test which lets you set how many GB to use) to load an approx ~10gb model fully offloaded into your 12GB of vram you should be able to replicate it

JohannesGaessler commented 1 year ago

I don't see why that would make a difference.

Energiz3r commented 1 year ago

???

Let's start from the top:

I have a 38GB model. I have 48gb of VRAM, but only 32GB of RAM so cannot run it on CPU
I fully offload it to GPUs
I get a cuda out of memory issue
There is ~12GB VRAM free when the error is thrown
System RAM is completely full

This makes no sense.

JohannesGaessler commented 1 year ago

The entire model is never loaded into RAM when offloading. When CUDA says it's out of memory it's referring to VRAM. My guess is that for some reason the logic for splitting tensors across GPUs doesn't work correctly on your system so everything gets put onto just one GPU and you run out of memory.

Energiz3r commented 1 year ago

I'm looking at the vram utilisation while loading and both cards are doing the same thing, getting filled up to around 18GB out of 24

If you have the time I'd be happy to let you TeamViewer in or something and take a look. I'm certain I'm not messing this up on my end and I'm not sure how else I can rule out user error. Like you say it doesn't make any sense that system ram is seeing much use at all let alone being completely filled.

JohannesGaessler commented 1 year ago

Sorry but to me fixing this issue simply isn't as urgent as it is to you. I'm perfectly happy with just waiting until more people provide information. I am willing to do remote debugging but not via Teamviewer or similar software. I only do it via SSH or equivalent.

Energiz3r commented 1 year ago

Okay... I didn't say it was urgent to me, or trying to rush you. Just trying to offer my help to solve this

I'm on a different system now, this one with a 4080 16GB and 128GB of RAM. I can load a 65b model with no layers offloaded to GPU and llama.cpp will occupy 56GB of RAM. If I offload 20 layers to GPU (llama.cpp occupies 12GB of VRAM) it will also occupy... 56GB of RAM. That's pretty definitive.

If reports from other users is what you need in order to warrant looking into this I'll see who else I can find to replicate the issue and refer them here 👍

JohannesGaessler commented 1 year ago

@LoganDark For something like this please make a separate issue rather than commenting on an existing, unrelated issue.

Mradr commented 1 year ago

I am also having a similar issue where it seems like its buffering to the system RAM along with filling up the VRAM. Aka, the more GPU layers I have the more system RAM it takes up. 3090 - for example, doesnt fill up fully at around 10-15GB out of 24 while system RAM usage jumps up to almost double. Lowering the gpu_layers results less VRAM usage and over all less more accurate memory size the model took.

Windows 11 wizardLM-13B-Uncensored.ggmlv3.q4_0.bin Cuda supported 32GB of RAM, 3090 video card 24 GB of VRAM

ex3ndr commented 9 months ago

I have similar problem for 2x4090, but i have 98gb of ram and it is still doesn't work

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

lee-b commented 3 months ago

I believe I'm seeing this too, with the official server-cuda image, pulled today, although note the "failed to initialize cuda", which no one above seemed to mention.

I'm running with 128GB RAM, 96GB VRAM (1x3090, 3xP40), loading Meta-Llama-3-70B-Instruct.Q5_K_M.gguf.

This is with:

$ sudo lsmod | grep nvidia_uvm nvidia_uvm 1380352 0 nvidia 56410112 55 nvidia_uvm,nvidia_modeset

and:

llama-cpp: image: ghcr.io/ggerganov/llama.cpp:server-cuda deploy: resources: reservations: devices:

driver: nvidia capabilities: [gpu] count: 4

and I get:

llama-cpp | ggml_cuda_init: failed to initialize CUDA: unknown error ... llama-cpp | llm_load_tensors: offloading 80 repeating layers to GPU llama-cpp | llm_load_tensors: offloading non-repeating layers to GPU llama-cpp | llm_load_tensors: offloaded 81/81 layers to GPU llama-cpp | llm_load_tensors: CPU buffer size = 47628.36 MiB llama-cpp | ................................................................................................... llama-cpp | llama_new_context_with_model: n_ctx = 8192 llama-cpp | llama_new_context_with_model: n_batch = 2048 llama-cpp | llama_new_context_with_model: n_ubatch = 512 llama-cpp | llama_new_context_with_model: flash_attn = 0 llama-cpp | llama_new_context_with_model: freq_base = 500000.0 llama-cpp | llama_new_context_with_model: freq_scale = 1 llama-cpp | ggml_cuda_host_malloc: failed to allocate 2560.00 MiB of pinned memory: unknown error llama-cpp | llama_kv_cache_init: CPU KV buffer size = 2560.00 MiB llama-cpp | llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB llama-cpp | ggml_cuda_host_malloc: failed to allocate 0.98 MiB of pinned memory: unknown error llama-cpp | llama_new_context_with_model: CPU output buffer size = 0.98 MiB llama-cpp | ggml_cuda_host_malloc: failed to allocate 1104.01 MiB of pinned memory: unknown error llama-cpp | llama_new_context_with_model: CUDA_Host compute buffer size = 1104.01 MiB

But this works fine:

check-gpu: image: nvidia/cuda:11.4.3-runtime-ubuntu20.04 command: nvidia-smi profiles: ["check-gpu"] deploy: resources: reservations: devices:

driver: nvidia capabilities: [gpu]

ggerganov / llama.cpp

CUDA out of memory - but there's plenty of memory #1866