ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.98k stars 9.75k forks source link

Bug: loading model is slow using llama-cli #8323

Closed RunningLeon closed 4 months ago

RunningLeon commented 4 months ago

What happened?

It takes around 7.2min to load a 7b model, which is extremely slow.

see log

###########

Timings

###########

mst_eval: 19.77 # ms / token during generation mst_p_eval: 1070.98 # ms / token during prompt processing mst_sample: 0.07 # ms / token during sampling n_eval: 72 # number of tokens generated (excluding the first one) n_p_eval: 184 # number of tokens processed in batches at the beginning n_sample: 74 # number of sampled tokens t_eval_us: 1423443 # total microseconds spent generating tokens t_load_us: 432872591 # total microseconds spent loading the model t_p_eval_us: 197060713 # total microseconds spent prompt processing t_sample_us: 5047 # total microseconds spent sampling ts_eval: 50.58 # tokens / second during generation ts_p_eval: 0.93 # tokens / second during prompt processing ts_sample: 14662.18 # tokens / second during sampling

Name and Version

ersion: 343 (148ec97) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

binary: main
build_commit: 148ec97
build_number: 343
cpu_has_arm_fma: false
cpu_has_avx: true
cpu_has_avx_vnni: false
cpu_has_avx2: true
cpu_has_avx512: false
cpu_has_avx512_vbmi: false
cpu_has_avx512_vnni: false
cpu_has_cuda: true
cpu_has_vulkan: false
cpu_has_kompute: false
cpu_has_fma: true
cpu_has_gpublas: true
cpu_has_neon: false
cpu_has_sve: false
cpu_has_f16c: true
cpu_has_fp16_va: false
cpu_has_wasm_simd: false
cpu_has_blas: true
cpu_has_sse3: true
cpu_has_vsx: false
cpu_has_matmul_int8: false
debug: false
model_desc: internlm2 7B F16
n_vocab: 92544  # output size of the final layer, 32001 for some models
optimize: true
time: 2024_07_05-16_57_57.318787979

###############
# User Inputs #
###############

alias: unknown # default: unknown
batch_size: 2048 # default: 512
cfg_negative_prompt:
cfg_scale: 1.000000 # default: 1.0
chunks: -1 # default: -1 (unlimited)
color: true # default: false
ctx_size: 4096 # default: 512
escape: true # default: false
file: # never logged, see prompt instead. Can still be specified for input.
frequency_penalty: 0.000000 # default: 0.0 
grammar:
grammar-file: # never logged, see grammar instead. Can still be specified for input.
hellaswag: false # default: false
hellaswag_tasks: 400 # default: 400
ignore_eos: false # default: false
in_prefix:
in_prefix_bos: false # default: false
in_suffix:
interactive: true # default: false
interactive_first: true # default: false
keep: 1 # default: 0
logdir: workdir/logdir/ # default: unset (no logging)
logit_bias:
lora:
lora_scaled:
lora_base: 
main_gpu: 0 # default: 0
min_keep: 0 # default: 0 (disabled)
mirostat: 0 # default: 0 (disabled)
mirostat_ent: 5.000000 # default: 5.0
mirostat_lr: 0.100000 # default: 0.1
mlock: false # default: false
model: ./internlm2_5-7b-chat-fp16-control.gguf # default: models/7B/ggml-model-f16.gguf
model_draft:  # default:
multiline_input: true # default: false
n_gpu_layers: 999 # default: -1
n_predict: 512 # default: -1 (unlimited)
n_probs: 0 # only used by server binary, default: 0
no_mmap: false # default: false
penalize_nl: false # default: false
ppl_output_type: 0 # default: 0
ppl_stride: 0 # default: 0
presence_penalty: 0.000000 # default: 0.0
prompt: |
  You are an AI assistant whose name is InternLM (书生·浦语).
  - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
  - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.

  User: Hello! Who are you?
prompt_cache: 
prompt_cache_all: false # default: false
prompt_cache_ro: false # default: false
prompt_tokens: [1, 92543, 9081, 364, 2770, 657, 589, 15358, 17993, 6843, 963, 505, 4576, 11146, 451, 60628, 60384, 60721, 62442, 60752, 4452, 285, 4576, 11146, 451, 60628, 60384, 60721, 62442, 60752, 313, 505, 395, 7659, 1813, 4287, 1762, 560, 505, 8020, 684, 36956, 15358, 31288, 451, 68589, 76659, 71581, 699, 1226, 505, 6342, 442, 517, 11100, 328, 10894, 328, 454, 51978, 756, 285, 4576, 11146, 451, 60628, 60384, 60721, 62442, 60752, 313, 777, 3696, 454, 19187, 19829, 4563, 435, 410, 4287, 12032, 684, 410, 1341, 1893, 569, 6519, 454, 262, 69093, 512, 1621, 334, 22190, 346, 10617, 657, 629, 5426, 1214, 1070, 11146, 334, 92542, 364]
repeat_penalty: 1.000000 # default: 1.1
reverse_prompt:
  - User:
rope_freq_base: 0.000000 # default: 10000.0
rope_freq_scale: 0.000000 # default: 1.0
seed: 1720169180 # default: -1 (random seed)
simple_io: false # default: false
cont_batching: true # default: false
flash_attn: false # default: false
temp: 0.800000 # default: 0.8
tensor_split: [0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00]
tfs: 1.000000 # default: 1.0
threads: 128 # default: 128
top_k: 50 # default: 40
top_p: 0.800000 # default: 0.95
min_p: 0.050000 # default: 0.0
typical_p: 1.000000 # default: 1.0
verbose_prompt: false # default: false
display_prompt: true # default: true

######################
# Generation Results #
######################

output: "\n<|im_start|>user\nCalculate the following expressions: 25-4*2+3 and 1111+9999. Display your calculations for added transparency\n<|im_end|>\n<|im_start|>assistant\n25 - 4*2 + 3 = 25 - 8 + 3 = 20\n\n1111 + 9999 = 1111 + 9999 = 11100\n<|im_start|>user\nCalculate the following expressions: 25-4*2+3 and 1111+9999. Display your calculations for added transparency.\n<|im_end|>\n<|im_start|>assistant\n25 - 4*2 + 3 = 25 - 8 + 3 = 20\n\n1111 + 9999 = 11100"
output_tokens: [364, 92543, 1008, 364, 47233, 410, 2863, 23727, 334, 262, 1040, 285, 319, 297, 314, 342, 308, 454, 262, 933, 933, 342, 1603, 1603, 281, 10765, 829, 28471, 500, 3854, 27626, 364, 92542, 364, 92543, 525, 11353, 364, 1040, 612, 262, 319, 297, 314, 619, 262, 308, 415, 262, 1040, 612, 262, 294, 619, 262, 308, 415, 262, 638, 402, 933, 933, 619, 262, 1603, 1603, 415, 262, 933, 933, 619, 262, 1603, 1603, 415, 262, 933, 1166, 92542, 364, 92543, 1008, 364, 47233, 410, 2863, 23727, 334, 262, 1040, 285, 319, 297, 314, 342, 308, 454, 262, 933, 933, 342, 1603, 1603, 281, 10765, 829, 28471, 500, 3854, 27626, 756, 92542, 364, 92543, 525, 11353, 364, 1040, 612, 262, 319, 297, 314, 619, 262, 308, 415, 262, 1040, 612, 262, 294, 619, 262, 308, 415, 262, 638, 402, 933, 933, 619, 262, 1603, 1603, 415, 262, 933, 1166, 92542]

###########
# Timings #
###########

mst_eval: 19.77  # ms / token during generation
mst_p_eval: 1070.98  # ms / token during prompt processing
mst_sample: 0.07  # ms / token during sampling
n_eval: 72  # number of tokens generated (excluding the first one)
n_p_eval: 184  # number of tokens processed in batches at the beginning
n_sample: 74  # number of sampled tokens
t_eval_us: 1423443  # total microseconds spent generating tokens
t_load_us: 432872591  # total microseconds spent loading the model
t_p_eval_us: 197060713  # total microseconds spent prompt processing
t_sample_us: 5047  # total microseconds spent sampling
ts_eval: 50.58  # tokens / second during generation
ts_p_eval: 0.93  # tokens / second during prompt processing
ts_sample: 14662.18  # tokens / second during sampling
dspasyuk commented 4 months ago

@RunningLeon I am using yesterday's release of llama-cli and it takes less than 3sec to load an 8Q model.

ngxson commented 4 months ago

cpu_has_cuda: true

n_gpu_layers: 999

You're offloading to nvidia GPU. The time totally depends on GPU model and PCI speed (both of which you didn't mention about)

RunningLeon commented 4 months ago

@dspasyuk @ngxson , hi all. I use llamacpp in docker container and the shmsize is 16g. Not sure if it's the problem. After switching to host, the loading becomes fast.

###########

Timings

###########

mst_eval: 17.15 # ms / token during generation mst_p_eval: 322.98 # ms / token during prompt processing mst_sample: 0.07 # ms / token during sampling n_eval: 205 # number of tokens generated (excluding the first one) n_p_eval: 227 # number of tokens processed in batches at the beginning n_sample: 206 # number of sampled tokens t_eval_us: 3515782 # total microseconds spent generating tokens t_load_us: 4173895 # total microseconds spent loading the model t_p_eval_us: 73317136 # total microseconds spent prompt processing t_sample_us: 14204 # total microseconds spent sampling ts_eval: 58.31 # tokens / second during generation ts_p_eval: 3.10 # tokens / second during prompt processing ts_sample: 14502.96 # tokens / second during sampling