ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Bug: [RPC] RPC apparently isn't honoring backend memory capacity et. al. #8112

Open ghchris2021 opened 3 days ago

ghchris2021 commented 3 days ago

What happened?

I'm in the process of experimenting with RPC using a fresh builds from ~today and I'm seeing some things that appear at first sight to be bugs and also perhaps just lacking support features & documentation

Test case: Pick a large GGUF that works with llama-cli on a single host using CPU+RAM+SWAP only inferencing and try to use RPC to offload some work to other LAN host(s).

RPC client: Using llama-cli similar to this: ` llama-cli --model DeepSeek-Coder-V2-Instruct.i1-Q4_K_S.gguf --prompt 'User: What is pi?\nAssistant: ' --n_predict -1 --ctx-size 8192 --temp 0.1 --repeat-penalty 1.0 --repeat-last-n 512 --top-k 0 --top-p 1.0 --min-p 0.0 --typical 1.0 --mirostat 0 --seed -1 --threads 12 --batch-size 32 --n-gpu-layers 99 --verbose-prompt --cache-type-k q8_0 --flash-attn --rpc localhost:1111,rpc_server_1:1111

main: build = 9 (f702a90) main: built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu llm_load_print_meta: model type = 236B llm_load_print_meta: model ftype = Q4_K - Small llm_load_print_meta: model params = 235.74 B llm_load_print_meta: model size = 124.68 GiB (4.54 BPW) llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct llm_load_tensors: offloading 60 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 61/61 layers to GPU llm_load_tensors: CPU buffer size = 470.10 MiB llm_load_tensors: RPC[1.1.1.1:1111] buffer size = 127200.00 MiB `

On the rpc-server side: ` bash -c "CUDA_VISIBLE_DEVICES="" /llama.cpp/build-rpc-cuda/bin/rpc-server --mem 3072"

create_backend: using CUDA backend ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected ggml_backend_cuda_init: invalid device 0 create_backend: ggml_backend_cuda_init() failed create_backend: using CPU backend Starting RPC server on 0.0.0.0:1111, backend memory: 3072 MB Accepted client connection, free_mem=3221225472, total_mem=3221225472 Client connection closed

`

Problems:

1: In the exemplified case, '--mem 3072' is passed to the RPC server program arguments and it seems to accept that by printing that it has around 3G of "free_mem", "total_mem", great. But the llama-cli rpc client APPEARS (its' log output) to be trying to allocate / use 125 GBy of RAM on this RPC backend, more than 120 GBy more than the backend's presumably configured (using rpc-server --mem 3072) memory limit for CPU inferencing. Watching it a while it is clear from the rpc-server process's RAM growth that the utilized RAM is growing steadily well in excess of the 3G expected RAM limit with no end in sight.

2: EDIT: I was observing unexpectedly low rpc client to rpc-server host communications; subsequently I have determined that there is an unexpected magnitude TCP overhead associated somehow with containerizing the rpc client which originates / conducts the data transmission. So I'll now assume that what I was seeing wrt. sub-wire-speed transfers between the two hosts during model transfer is likely attributable to that cause and is not indicated as the llama.cpp RPC networking.

3: In tangent to point 2 above, as a "feature / enhancement" suggestion, maybe it could be possible to tell the RPC client/server that a RPC-server actually has a local / fast "mirror" of the model file available and it would be preferred to load it e.g. from local disc as opposed to over the RPC network.

4: So in another test case I did almost exactly as above as a test with the primary difference that the rpc-server side host was run WITHOUT the CUDA_VISIBLE_DEVICES="" argument and in that case it was able to "see" that two 8GB VRAM GPUs were present in the rpc-server host as well as the rpc-server host's CPU/RAM resources. I expected it to try to offload "as much as would work" (e.g.. NGL=99) of layers to the CUDA GPUs on the rpc-server host and then spill up to the configured 32GBy to rpc-server host RAM/CPU offload and then process the rest of the model (80 GBy or whatever) on the rpc-client side host which has ~enough RAM for the whole model itself plus swap. In this error case the rpc-client starts loading the model, connects to the rpc-server host, apparently tries to send 125 GBy of model data to the rpc-server and malloc 125 GBy of VRAM from the CUDA GPU(s) which of course fails spectacularly and results in very prompt complete failure of the whole RPC client/server inference process; no attempt apparently was made to either portion the GPU offload request to the actual available VRAM on either GPU, nor offload to the RPC server / RPC client system RAM/CPU. As an aside for a totally different use case I have noticed when NOT using RPC that when I set NGL to 99 on the system with 2x8GBy GPUs that llama-server non-RPC inference seems to tend to overestimate the available GPU VRAM and fail to run instead of loading a layer count that will actually fit in GPUs available VRAM then offload the rest of the layers to the CPU/RAM. When manually setting the NGL parameters sufficiently low, however, in that non-RPC test case, I could find a value that allowed ~75% of VRAM to be allocated and offload the rest. So I'm guessing either I'm doing something consistently wrong or the code's overzealously estimating VRAM availability. In these cases I am using the GPUs in a GUI desktop so somewhere between like 40MBy and 300 Mby are taken (nvidia-smi report) for "system use" out of one GPU's VRAM available and a lesser amount out of the other's but that's a tiny fraction of the VRAM available so really I'd have hoped for it to adapt / predict to this "happy case 95% of VRAM is available" situation.

I may well be using the llama.cpp tools naively wrt. specific use of options to control resource allocation / distribution. I've found out about NGL and use it, I found the rpc-server "--mem" property and I assume I'm using it correctly. The other parameters relating to tensor split types and gpu unit number percentage prioritization I haven't used and don't assume they'd really solve the "don't ask for too much memory / VRAM" problems I see. So maybe I'm missing something entirely different / superior?

Name and Version

llama-cli --version version: 9 (f702a90) built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

see above
ghchris2021 commented 3 days ago

Here's log output from the case where the rpc-server does not prohibit the use of the GPUs and the process fails very quickly due to overallocation of VRAM:

rpc-server:

ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes Starting RPC server on 0.0.0.0:1111, backend memory: 3072 MB Accepted client connection, free_mem=3221225472, total_mem=3221225472 Client connection closed Accepted client connection, free_mem=3221225472, total_mem=3221225472 Client connection closed Accepted client connection, free_mem=3221225472, total_mem=3221225472 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 127200.00 MiB on device 0: cudaMalloc failed: out of memory Client connection closed

rpc client: llm_load_tensors: ggml ctx size = 0.80 MiB llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'DeepSeek-Coder-V2-Instruct.i1-Q4_K_S.gguf' main: error: unable to load model

rgerganov commented 3 days ago
  1. The available memory reported by the rpc-server is not enforced but used as a hint to the llama scheduler when splitting layers across devices. This works fine when all devices report their available memory and fails otherwise. For example if you build llama-cli with CPU backend and then offload to an RPC server started with --mem 2000, the scheduler will see the following:

    device 0 (CPU): available memory 1 byte device 1 (RPC): available memory 2000MB

and then it will try to load everything in the RPC server. As a workaround you can use the --tensor-split option which explicitly defines how tensors should be splits across the available devices.

  1. I am not able to reproduce this. When I offload to an RPC server over gigabit link, I am observing 800Mb/s on average when transferring the model:

rpc-speed

ghchris2021 commented 2 days ago

@rgerganov Thank you very much for the information you shared with me wrt. the "--mem", "--tensor-split" options and the way the scheduler handles things in the aforementioned cases. I will try to set up a scenario where the two GPUs are singly behind individual RPC servers, and similarly the CPU backend of the remote host is also behind an individual RPC server, and that they all should be set up to indicate / proportion the use of their memory with "--mem", "--tensor-split" and I'll see how that works wrt. client to server memory resource population.

Thank you very much also for giving independent feedback wrt. (2) the RPC client/server network performance given the SW / protocols etc. I apologize for my error in suggesting that may be due to some SW / protocol inefficiency as an explanation of my 70Mb/s rate measurements. Further investigation indicates that was probably attributable to some network inefficiencies relating to how I had the rpc-client containerized during that test as well as suboptimal choice of MTU on my NICs. I will have to investigate why the containerization setup I used had such an impact on the TCP connection which apparently was for my test worse than some upstream benchmarks I've seen which suggested 10x the observed performance. But I now think it is no matter for llama.cpp's protocols etc., so I apologize for the mistake about that.

Though in my "most compelling" use case for RPC (running locally models much larger than the available RAM / VRAM on any available single host and making fuller use of available distributed resources), still models from 70 GBy to 200-400 GBy sizes could be of interest to use and even with a 1Gb, 2.5 Gb, or 10 Gb LAN link between RPC participants it would still be significantly slower in such cases to transfer model data over any such ethernet as opposed to providing an option to permit, as the user may desire, the RPC servers to load the model data from (much faster) local SSD file systems, so I think that could be a compelling feature enhancement (which I filed when the idea occurred to me yesterday).

Thank you very much once again for the response & information!