ggerganov / llama.cpp

LLM inference in C/C++
MIT License
63.02k stars 9.03k forks source link

Bug: rpc-server --mem Doesn't Match backend memory #8417

Open oldgithubman opened 3 weeks ago

oldgithubman commented 3 weeks ago

What happened?

$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 10000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1808 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 20000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 3616 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 30000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1328 MB

I expected backend memory: $mem MB when I input --mem $mem

Name and Version

$ ./build/bin/Release/llama-cli --version
version: 3368 (dd07a123)
built with MSVC 19.40.33812.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 10000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1808 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 20000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 3616 MB
$ CUDA_VISIBLE_DEVICES=0 build/bin/Release/rpc-server -p 50052 --mem 30000
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 1328 MB
oldgithubman commented 3 weeks ago

This should actually be high severity

myan-o commented 3 weeks ago

Memory limits(rpc-server --mem) are not working!!

oldgithubman commented 3 weeks ago

Memory limits(rpc-server --mem) are not working!!

? I know? That's what I'm saying?

myan-o commented 3 weeks ago

There is a problem where all memory is used even if --mem is specified.

oldgithubman commented 3 weeks ago

There is a problem where all memory is used even if --mem is specified.

Awesome. /s Thanks for telling me though

myan-o commented 3 weeks ago

It loads only the number of layers set with --ngl, so it crashes due to a buffer overflow.

myan-o commented 3 weeks ago

Ideally, it would be better to change the specification so that -ngl can be set individually on the RPC server side.

oldgithubman commented 3 weeks ago

Ideally, it would be better to change the specification so that -ngl can be set individually on the RPC server side.

I think fixing --mem would be better. Remote servers should be as hands-off as possible and -ngl should ideally become a --mem -type option as well. Would make way more sense than -ngl

ghchris2021 commented 2 weeks ago

q.v.

I also found the way the RPC server and client deals with specifying / limiting what memory on the CPU / GPU resources to be confusing and limited and so I, too, would like to see simple / clear means of limiting what memory (RAM/VRAM) is used on each node. IMO it'd also be nicer if the model data could be locally loaded vs. uploaded over the network to the RPC servers, too.

8112

Bug: [RPC] RPC apparently isn't honoring backend memory capacity et. al. #8112

8113

Feature Request: Provide means to quantify the restriction of RAM/VRAM usage for each GPU and system RAM. #8113

8114

Feature Request: It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever).