Closed RunningLeon closed 4 months ago
@RunningLeon I am using yesterday's release of llama-cli and it takes less than 3sec to load an 8Q model.
cpu_has_cuda: true
n_gpu_layers: 999
You're offloading to nvidia GPU. The time totally depends on GPU model and PCI speed (both of which you didn't mention about)
@dspasyuk @ngxson , hi all. I use llamacpp in docker container and the shmsize is 16g. Not sure if it's the problem. After switching to host, the loading becomes fast.
###########
###########
mst_eval: 17.15 # ms / token during generation mst_p_eval: 322.98 # ms / token during prompt processing mst_sample: 0.07 # ms / token during sampling n_eval: 205 # number of tokens generated (excluding the first one) n_p_eval: 227 # number of tokens processed in batches at the beginning n_sample: 206 # number of sampled tokens t_eval_us: 3515782 # total microseconds spent generating tokens t_load_us: 4173895 # total microseconds spent loading the model t_p_eval_us: 73317136 # total microseconds spent prompt processing t_sample_us: 14204 # total microseconds spent sampling ts_eval: 58.31 # tokens / second during generation ts_p_eval: 3.10 # tokens / second during prompt processing ts_sample: 14502.96 # tokens / second during sampling
What happened?
It takes around 7.2min to load a 7b model, which is extremely slow.
see log
###########
Timings
###########
mst_eval: 19.77 # ms / token during generation mst_p_eval: 1070.98 # ms / token during prompt processing mst_sample: 0.07 # ms / token during sampling n_eval: 72 # number of tokens generated (excluding the first one) n_p_eval: 184 # number of tokens processed in batches at the beginning n_sample: 74 # number of sampled tokens t_eval_us: 1423443 # total microseconds spent generating tokens t_load_us: 432872591 # total microseconds spent loading the model t_p_eval_us: 197060713 # total microseconds spent prompt processing t_sample_us: 5047 # total microseconds spent sampling ts_eval: 50.58 # tokens / second during generation ts_p_eval: 0.93 # tokens / second during prompt processing ts_sample: 14662.18 # tokens / second during sampling
Name and Version
ersion: 343 (148ec97) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output