QwenLM / qwen.cpp

C++ implementation of Qwen-LM
Other
506 stars 40 forks source link

feat: add more max_length constraint for resource limit machines #41

Open fann1993814 opened 7 months ago

fann1993814 commented 7 months ago

Hi, @simonJJJ I am so glad to see you update for M1/M2 support, thanks. So, I can close my previous PR #39

There are some useful features can help resource limit machines for computing.

There are my experiments in this PR.

Experiments Setting

Hello! How can I help you today?

prompt time: 4385.79 ms / 20 tokens (219.289 ms/token) output time: 67534 ms / 10 tokens (6753.4 ms/token) total time: 71919.8 ms

- GPU (M1 GPU, master branch)

I cannot run, because OOM.

- CPU (M1 CPU, this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 80.018 ms

Hello! How can I help you today? Is there something you would like to talk about or learn more about? I'm here to answer any questions you may have.

prompt time: 5553.58 ms / 20 tokens (277.679 ms/token) output time: 3417.43 ms / 35 tokens (97.64 ms/token) total time: 8971.01 ms

- GPU (M1 GPU with Metal,  this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 1 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 122.671 ms

Hello! How can I help you today?

prompt time: 460.668 ms / 20 tokens (23.033 ms/token) output time: 811.612 ms / 10 tokens (81.161 ms/token) total time: 1272.28 ms



Spend Time (Output Time, Lower is better)
|  CPU(master) | GPU(master) |  CPU(this PR)   | GPU(this PR) |
|  ------------- | ------------- |----  |  ----  |
| 6753.4 ms/token  | OOM | 97.64 ms/token | 81.161 ms/token |