[Question] why i have 40 seconds more for the loading of the model, when i compare with llama-cpp

papipsycho commented 10 months ago

Hello Guys,

I'm wondering about performence, which is very strange on the same server, i ran the same model with query, and the loading time is totally differente between llama-cpp python and llama-cpp, in both case i put the same option (like number of threads ect...)

== lama-cpp --> llama_print_timings: load time = 2062.63 ms llama_print_timings: sample time = 36.35 ms / 55 runs ( 0.66 ms per token, 1513.15 tokens per second) llama_print_timings: prompt eval time = 7583.04 ms / 97 tokens ( 78.18 ms per token, 12.79 tokens per second) llama_print_timings: eval time = 8272.12 ms / 54 runs ( 153.19 ms per token, 6.53 tokens per second) llama_print_timings: total time = 15922.60 ms

===== python -->

llama_print_timings: load time = 42078.39 ms llama_print_timings: sample time = 92.67 ms / 142 runs ( 0.65 ms per token, 1532.30 tokens per second) llama_print_timings: prompt eval time = 137100.44 ms / 1546 tokens ( 88.68 ms per token, 11.28 tokens per second) llama_print_timings: eval time = 25390.80 ms / 141 runs ( 180.08 ms per token, 5.55 tokens per second) llama_print_timings: total time = 163280.32 ms

so i really wondering why there such a huge difference.

chenanlin2000 commented 9 months ago

同

cegorah commented 8 months ago

Hey @papipsycho could you provide some more details: the code sample, system you use? I also noticed a slight difference in the numbers on my setup, but it's about x2 rather than x20 like in your example.

beepbopbeepboop commented 5 months ago

When running mixtral 8x7B, Q3 on a 32mb ubuntu box:

llama_print_timings: load time = 107317.09 ms llama_print_timings: sample time = 17.25 ms / 32 runs ( 0.54 ms per token, 1854.64 tokens per second) llama_print_timings: prompt eval time = 107297.40 ms / 11 tokens ( 9754.31 ms per token, 0.10 tokens per second) llama_print_timings: eval time = 15622.09 ms / 31 runs ( 503.94 ms per token, 1.98 tokens per second) llama_print_timings: total time = 123508.55 ms / 42 tokens

real 4m35.576s user 3m19.488s sys 0m20.319s

versus:

llama_print_timings: load time = 1394.81 ms llama_print_timings: sample time = 1.13 ms / 34 runs ( 0.03 ms per token, 30035.34 tokens per second) llama_print_timings: prompt eval time = 1089.77 ms / 11 tokens ( 99.07 ms per token, 10.09 tokens per second) llama_print_timings: eval time = 6138.24 ms / 33 runs ( 186.01 ms per token, 5.38 tokens per second) llama_print_timings: total time = 7237.49 ms / 44 tokens Log end

real 0m9.051s user 1m0.074s sys 0m1.514s

The slowness comes from llm_load_tensors:, one feels like a a cached mmap, fast, the other feels like no cache, no mmap or pathetic slow recompute of state.

The differences in stderr are, slow:

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '12', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'llama.expert_count': '8', 'llama.context_length': '32768', 'general.name': 'mistralai_mixtral-8x7b-instruct-v0.1', 'llama.expert_used_count': '2'} Guessed chat format: mistral-instruct

and in fast:

and the reason why, from vmstat:

$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 2997820 9757100 40 21238692 18 26 269 363 38 7 46 15 38 1 0 1 0 2997820 9757100 40 21238820 1 0 2 2 1515 2402 0 0 100 0 0 2 1 2996284 7350288 40 21238836 558 0 558 826 1949 2876 0 1 96 2 0 8 0 2995004 7259064 40 21238888 153 0 153 10 5134 4458 50 0 49 0 0 1 0 2995004 9756612 40 21238976 0 0 0 867 4495 4128 37 1 62 0 0

0 0 2995004 9756360 40 21239136 0 0 0 14 1522 2431 0 0 100 0 0

0 0 2995004 9755352 1 1 2995004 9733292 2 0 2995004 1865488 0 1 3010260 281080 1 2 3219552 230232 1 0 3592280 253952 1 2 3945292 256152 1 0 4284452 290328 0 2 4699896 230648 0 1 5068500 262312 3 3 5387632 319596 1 1 5416304 229104 1 9 5458960 255100 1 10 5526128 236568 1 9 5609748 260876 6 5 5693940 238880 0 8 5718828 252288 0 9 5718828 276180 1 8 5745204 279988 1 12 5774900 254572 1 13 5823764 231616 8 7 5904492 292532 1 10 5988376 242844 0 9 6053340 267112 0 12 6082012 230984 9 1 6156928 231828 1 8 6227064 253616 2 7 6248136 257688 8 0 6248136 234764 1 7 6248144 233028 8 0 6248912 264324 9 0 6249424 248732 9 0 6249168 242524 9 0 6248912 241916 1 1 6234832 244024 0 0 3021776 22348492 1 0 3021776 22347488 0 0 3021776 22346484 1 0 3021520 22341952 1 0 3021520 22341204 1 0 3021520 22340952 2 0 3021520 22340480 procs -----------me r b swpd 40 21239608 0 0 0 1385 1694 2617 0 0 100 0 0 40 21245844 4 0 1229 0 1923 2952 0 0 97 2 0 40 21261644 0 0 3032 838 2539 3546 1 4 92 3 0 4 15364388 0 3052 88850 3063 10587 4853 1 6 87 6 0 4 14803076 5 41768 152455 42256 5661 6803 0 2 70 28 0 4 14628304 3 74520 107196 75331 5735 4681 0 2 87 11 0 16 14309028 3 70550 129255 70576 5708 4863 0 2 88 9 0 40 14122408 11 67846 103806 69184 5233 4467 0 2 88 10 0 40 14005196 0 83106 117317 83106 5300 4443 0 2 88 9 0 40 13621408 6 73736 145457 74595 6607 5243 0 3 89 8 0 40 13186312 9 63830 143133 63830 5397 5674 0 3 82 15 0 40 11026712 1778 6918 23504 7503 3556 4374 0 2 89 9 0 40 10956860 18941 9530 19665 10395 3720 5104 2 0 72 26 0 40 10900552 29580 13447 33224 14376 4470 5841 5 1 46 48 0 40 10793480 32328 16699 33903 16701 4813 5267 5 1 48 45 0 40 10744844 29015 16855 30929 17540 16338 5718 13 1 46 40 0 40 10575128 37498 4956 37606 5665 4501 5600 3 1 52 44 0 40 10340688 43263 1 43374 1 4367 5613 2 1 52 45 0 40 10173216 38936 5278 39125 6116 4486 5586 6 1 56 38 0 40 10042116 36288 5965 36742 6301 7143 6430 4 1 53 43 0 40 9968424 29737 9770 33574 10628 4372 5705 9 1 50 40 0 40 9853572 26110 16112 37657 16114 23345 6091 15 1 43 42 0 40 9835216 29684 16893 43583 17744 5791 8596 10 1 43 46 0 40 9727992 30500 12992 31040 12992 3851 5006 4 1 51 44 0 40 9679040 25314 5752 29622 6660 4045 5283 6 1 47 46 0 40 9624692 27004 15059 27551 15059 9753 5058 10 1 49 40 0 40 9519760 31408 14013 31418 15398 6337 4917 3 1 57 40 0 40 9326128 40658 4242 40669 4754 4128 5342 2 1 58 39 0 40 9173968 34870 0 34870 6 4682 5296 17 1 51 31 0 40 8996804 35709 62 35798 558 4520 5309 7 1 51 41 0 40 8835144 27969 216 30906 746 7197 5486 19 1 48 32 0 40 8777704 14820 108 16069 1018 9075 5175 35 1 51 14 0 40 8777116 54 0 64 543 5273 4506 49 0 50 0 0 40 8776616 44 0 62 670 5100 4299 49 0 50 0 0 40 8743916 3846 40 3988 885 4796 5318 29 0 68 3 0 40 8754004 2230 0 4269 2 1998 3001 0 1 97 1 0 40 8754084 0 0 0 862 1802 2782 0 0 100 0 0 40 8754288 8 0 63 0 1569 2464 0 0 99 0 0 40 8758388 2 0 2 891 1636 2547 0 0 99 0 0 40 8758744 1 0 4 353 1434 2390 0 0 100 0 0 40 8758864 0 0 0 853 1680 2640 0 0 100 0 0 40 8758976 0 0 0 2 1505 2428 0 0 100 0 0 mory---------- ---swap-- -----io---- -system-- ------cpu----- free buff cache si so bi bo in cs us sy id wa st

The first is fast, the second is slow. The log in both should print the version number. Then, we can tell if they are the same version. Could be speed improvements in the latest codebase if you are using old code, or, maybe a different configuration? I'm doing CPU only and don't have the optional extra cuda bits for either.

So, only 26x slower.

abetlen / llama-cpp-python

[Question] why i have 40 seconds more for the loading of the model, when i compare with llama-cpp #982