ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.25k stars 9.53k forks source link

Bug: Load time on rpc server with multiple machines #9820

Open angelosathanasiadis opened 4 days ago

angelosathanasiadis commented 4 days ago

What happened?

I have managed to run the rpc server on 2 different machines running ubuntu (with different IPs) with the following commands:

1st machine: _bin/rpc-server -H MY_PUPLICIP -p 50052 2nd machine: _bin/llama-cli -m ../tinydolphin-2.8.2-1.1b-laser.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 6 --rpc MY_PUPLICIP:50052 -ngl 99

I have noticed that the load time is huge (Compared to running the model localy using rpc server, where it is only 600ms.): _llama_perf_sampler_print: sampling time = 0,14 ms / 12 runs ( 0,01 ms per token, 82758,62 tokens per second) llama_perf_context_print: load time = 55658,27 ms llama_perf_context_print: prompt eval time = 426,00 ms / 6 tokens ( 71,00 ms per token, 14,08 tokens per second) llama_perf_context_print: eval time = 997,43 ms / 5 runs ( 199,49 ms per token, 5,01 tokens per second) llama_perf_contextprint: total time = 1424,04 ms / 11 tokens

My question is what exaclty happens during the load time? If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?

Name and Version

version: 3789 (d39e2674) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

No response

rgerganov commented 4 days ago

My question is what exaclty happens during the load time?

Model layers are being transferred to the RPC server

If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?

This has been requested several times, @slaren put some ideas here