Open angelosathanasiadis opened 4 days ago
My question is what exaclty happens during the load time?
Model layers are being transferred to the RPC server
If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?
This has been requested several times, @slaren put some ideas here
What happened?
I have managed to run the rpc server on 2 different machines running ubuntu (with different IPs) with the following commands:
1st machine: _bin/rpc-server -H MY_PUPLICIP -p 50052 2nd machine: _bin/llama-cli -m ../tinydolphin-2.8.2-1.1b-laser.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 6 --rpc MY_PUPLICIP:50052 -ngl 99
I have noticed that the load time is huge (Compared to running the model localy using rpc server, where it is only 600ms.): _llama_perf_sampler_print: sampling time = 0,14 ms / 12 runs ( 0,01 ms per token, 82758,62 tokens per second) llama_perf_context_print: load time = 55658,27 ms llama_perf_context_print: prompt eval time = 426,00 ms / 6 tokens ( 71,00 ms per token, 14,08 tokens per second) llama_perf_context_print: eval time = 997,43 ms / 5 runs ( 199,49 ms per token, 5,01 tokens per second) llama_perf_contextprint: total time = 1424,04 ms / 11 tokens
My question is what exaclty happens during the load time? If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?
Name and Version
version: 3789 (d39e2674) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
No response