Closed hafezmg48 closed 1 month ago
I'm not sure if it's the same issue (I am not using RPC), but the inference speed has dramatically slowed compared to the older version of llama.cpp.
Despite performing CPU-based inference, it's about 4-5 times slower than the version that ran "main" (2-3 tokens/sec -> 0.4 tokens/sec). For reference, the performance of the old and current versions of llama.cpp was compared on the same server.
Is there a solution to this performance issue?
@hafezmg48 I am not able to reproduce such regression with the latest code (commit 6e02327e8b783):
Results without RPC:
➜ build-rpc-cuda git:(master) ✗ bin/llama-bench -m ../models/tinyllama-1.1b-f16.gguf -ngl 99 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| llama 1B F16 | 2.05 GiB | 1.10 B | CUDA | 99 | 1 | pp512 | 315.02 ± 0.34 |
| llama 1B F16 | 2.05 GiB | 1.10 B | CUDA | 99 | 1 | tg128 | 74.83 ± 0.05 |
build: 6e02327e (3565)
Results with RPC:
➜ build-rpc-cuda git:(master) ✗ bin/llama-bench -m ../models/tinyllama-1.1b-f16.gguf -ngl 99 -fa 1 --rpc localhost:50052
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
| llama 1B F16 | 2.05 GiB | 1.10 B | CUDA+RPC | 99 | 1 | pp512 | 314.06 ± 0.30 |
| llama 1B F16 | 2.05 GiB | 1.10 B | CUDA+RPC | 99 | 1 | tg128 | 68.74 ± 0.01 |
build: 6e02327e (3565)
Can you try llama-bench
and post the results with different models?
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
I am trying to run inference on RPC example. When running the llama-cli with rpc feature over a single rpc-server on localhost, the inference throughput is only 1.9 tok/sec for llama3.1-8B on CUDA, while the same llama-cli on local cuda build without rpc generates 25 tok/sec.
So about 13x slower even thought the server is in localhost, basically using same GPU locally but through rpc.
Name and Version
followed exact steps in https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc
running cli with command: bin/llama-cli -m ./llama3.1-8B-F16.gguf -p "Hello, my name is" -n 64 --rpc localhost:50052 -ngl 99
running rpc-server: bin/rpc-server -p 50052 create_backend: using CUDA backend ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100-PCIE-32GB, compute capability 7.0, VMM: yes Starting RPC server on 0.0.0.0:50052, backend memory: 28170 MB Accepted client connection, free_mem=29538713600, total_mem=34079899648 Client connection closed
What operating system are you seeing the problem on?
Linux
Relevant log output