abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.83k stars 934 forks source link

[Question] why i have 40 seconds more for the loading of the model, when i compare with llama-cpp #982

Open papipsycho opened 10 months ago

papipsycho commented 10 months ago

Hello Guys,

I'm wondering about performence, which is very strange on the same server, i ran the same model with query, and the loading time is totally differente between llama-cpp python and llama-cpp, in both case i put the same option (like number of threads ect...)

== lama-cpp --> llama_print_timings: load time = 2062.63 ms llama_print_timings: sample time = 36.35 ms / 55 runs ( 0.66 ms per token, 1513.15 tokens per second) llama_print_timings: prompt eval time = 7583.04 ms / 97 tokens ( 78.18 ms per token, 12.79 tokens per second) llama_print_timings: eval time = 8272.12 ms / 54 runs ( 153.19 ms per token, 6.53 tokens per second) llama_print_timings: total time = 15922.60 ms

===== python -->

llama_print_timings: load time = 42078.39 ms llama_print_timings: sample time = 92.67 ms / 142 runs ( 0.65 ms per token, 1532.30 tokens per second) llama_print_timings: prompt eval time = 137100.44 ms / 1546 tokens ( 88.68 ms per token, 11.28 tokens per second) llama_print_timings: eval time = 25390.80 ms / 141 runs ( 180.08 ms per token, 5.55 tokens per second) llama_print_timings: total time = 163280.32 ms

so i really wondering why there such a huge difference.

chenanlin2000 commented 9 months ago

cegorah commented 8 months ago

Hey @papipsycho could you provide some more details: the code sample, system you use? I also noticed a slight difference in the numbers on my setup, but it's about x2 rather than x20 like in your example.

beepbopbeepboop commented 5 months ago

When running mixtral 8x7B, Q3 on a 32mb ubuntu box:

llama_print_timings: load time = 107317.09 ms llama_print_timings: sample time = 17.25 ms / 32 runs ( 0.54 ms per token, 1854.64 tokens per second) llama_print_timings: prompt eval time = 107297.40 ms / 11 tokens ( 9754.31 ms per token, 0.10 tokens per second) llama_print_timings: eval time = 15622.09 ms / 31 runs ( 503.94 ms per token, 1.98 tokens per second) llama_print_timings: total time = 123508.55 ms / 42 tokens

real 4m35.576s user 3m19.488s sys 0m20.319s

versus:

llama_print_timings: load time = 1394.81 ms llama_print_timings: sample time = 1.13 ms / 34 runs ( 0.03 ms per token, 30035.34 tokens per second) llama_print_timings: prompt eval time = 1089.77 ms / 11 tokens ( 99.07 ms per token, 10.09 tokens per second) llama_print_timings: eval time = 6138.24 ms / 33 runs ( 186.01 ms per token, 5.38 tokens per second) llama_print_timings: total time = 7237.49 ms / 44 tokens Log end

real 0m9.051s user 1m0.074s sys 0m1.514s

The slowness comes from llm_load_tensors:, one feels like a a cached mmap, fast, the other feels like no cache, no mmap or pathetic slow recompute of state.

The differences in stderr are, slow:

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '12', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'llama.expert_count': '8', 'llama.context_length': '32768', 'general.name': 'mistralai_mixtral-8x7b-instruct-v0.1', 'llama.expert_used_count': '2'} Guessed chat format: mistral-instruct

and in fast:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 17024, n_batch = 2048, n_predict = -1, n_keep = 1

and the reason why, from vmstat:

$ vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 2997820 9757100 40 21238692 18 26 269 363 38 7 46 15 38 1 0 1 0 2997820 9757100 40 21238820 1 0 2 2 1515 2402 0 0 100 0 0 2 1 2996284 7350288 40 21238836 558 0 558 826 1949 2876 0 1 96 2 0 8 0 2995004 7259064 40 21238888 153 0 153 10 5134 4458 50 0 49 0 0 1 0 2995004 9756612 40 21238976 0 0 0 867 4495 4128 37 1 62 0 0

0 0 2995004 9756360 40 21239136 0 0 0 14 1522 2431 0 0 100 0 0

0 0 2995004 9755352 40 21239608 0 0 0 1385 1694 2617 0 0 100 0 0 1 1 2995004 9733292 40 21245844 4 0 1229 0 1923 2952 0 0 97 2 0 2 0 2995004 1865488 40 21261644 0 0 3032 838 2539 3546 1 4 92 3 0 0 1 3010260 281080 4 15364388 0 3052 88850 3063 10587 4853 1 6 87 6 0 1 2 3219552 230232 4 14803076 5 41768 152455 42256 5661 6803 0 2 70 28 0 1 0 3592280 253952 4 14628304 3 74520 107196 75331 5735 4681 0 2 87 11 0 1 2 3945292 256152 16 14309028 3 70550 129255 70576 5708 4863 0 2 88 9 0 1 0 4284452 290328 40 14122408 11 67846 103806 69184 5233 4467 0 2 88 10 0 0 2 4699896 230648 40 14005196 0 83106 117317 83106 5300 4443 0 2 88 9 0 0 1 5068500 262312 40 13621408 6 73736 145457 74595 6607 5243 0 3 89 8 0 3 3 5387632 319596 40 13186312 9 63830 143133 63830 5397 5674 0 3 82 15 0 1 1 5416304 229104 40 11026712 1778 6918 23504 7503 3556 4374 0 2 89 9 0 1 9 5458960 255100 40 10956860 18941 9530 19665 10395 3720 5104 2 0 72 26 0 1 10 5526128 236568 40 10900552 29580 13447 33224 14376 4470 5841 5 1 46 48 0 1 9 5609748 260876 40 10793480 32328 16699 33903 16701 4813 5267 5 1 48 45 0 6 5 5693940 238880 40 10744844 29015 16855 30929 17540 16338 5718 13 1 46 40 0 0 8 5718828 252288 40 10575128 37498 4956 37606 5665 4501 5600 3 1 52 44 0 0 9 5718828 276180 40 10340688 43263 1 43374 1 4367 5613 2 1 52 45 0 1 8 5745204 279988 40 10173216 38936 5278 39125 6116 4486 5586 6 1 56 38 0 1 12 5774900 254572 40 10042116 36288 5965 36742 6301 7143 6430 4 1 53 43 0 1 13 5823764 231616 40 9968424 29737 9770 33574 10628 4372 5705 9 1 50 40 0 8 7 5904492 292532 40 9853572 26110 16112 37657 16114 23345 6091 15 1 43 42 0 1 10 5988376 242844 40 9835216 29684 16893 43583 17744 5791 8596 10 1 43 46 0 0 9 6053340 267112 40 9727992 30500 12992 31040 12992 3851 5006 4 1 51 44 0 0 12 6082012 230984 40 9679040 25314 5752 29622 6660 4045 5283 6 1 47 46 0 9 1 6156928 231828 40 9624692 27004 15059 27551 15059 9753 5058 10 1 49 40 0 1 8 6227064 253616 40 9519760 31408 14013 31418 15398 6337 4917 3 1 57 40 0 2 7 6248136 257688 40 9326128 40658 4242 40669 4754 4128 5342 2 1 58 39 0 8 0 6248136 234764 40 9173968 34870 0 34870 6 4682 5296 17 1 51 31 0 1 7 6248144 233028 40 8996804 35709 62 35798 558 4520 5309 7 1 51 41 0 8 0 6248912 264324 40 8835144 27969 216 30906 746 7197 5486 19 1 48 32 0 9 0 6249424 248732 40 8777704 14820 108 16069 1018 9075 5175 35 1 51 14 0 9 0 6249168 242524 40 8777116 54 0 64 543 5273 4506 49 0 50 0 0 9 0 6248912 241916 40 8776616 44 0 62 670 5100 4299 49 0 50 0 0 1 1 6234832 244024 40 8743916 3846 40 3988 885 4796 5318 29 0 68 3 0 0 0 3021776 22348492 40 8754004 2230 0 4269 2 1998 3001 0 1 97 1 0 1 0 3021776 22347488 40 8754084 0 0 0 862 1802 2782 0 0 100 0 0 0 0 3021776 22346484 40 8754288 8 0 63 0 1569 2464 0 0 99 0 0 1 0 3021520 22341952 40 8758388 2 0 2 891 1636 2547 0 0 99 0 0 1 0 3021520 22341204 40 8758744 1 0 4 353 1434 2390 0 0 100 0 0 1 0 3021520 22340952 40 8758864 0 0 0 853 1680 2640 0 0 100 0 0 2 0 3021520 22340480 40 8758976 0 0 0 2 1505 2428 0 0 100 0 0 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st

The first is fast, the second is slow. The log in both should print the version number. Then, we can tell if they are the same version. Could be speed improvements in the latest codebase if you are using old code, or, maybe a different configuration? I'm doing CPU only and don't have the optional extra cuda bits for either.

So, only 26x slower.