QwenLM / qwen.cpp

C++ implementation of Qwen-LM
Other
506 stars 40 forks source link

feat: add metal support #39

Closed fann1993814 closed 7 months ago

fann1993814 commented 8 months ago
  1. Follow chatglm.cpp to implement.
  2. Modify max_length for initialization, because MacBook Air M1's VRAM cannot hold original setting (2048), It will let gpu out of memory, also minimize compute space in lower length setting for kv
  3. Scale MEM_SIZE and SCRATCH_SIZE make reasonable for max_length modification.

My environment is MacBook Air M1 8G. I run commend ./build/bin/main -m qwen7b-ggml.bin -l 512 -v --tiktoken qwen.tiktoken -p hello. Memory only 8G cannot set too larger max_length.

Experiments

Hello! How can I help you today?

prompt time: 4385.79 ms / 20 tokens (219.289 ms/token) output time: 67534 ms / 10 tokens (6753.4 ms/token) total time: 71919.8 ms

- CPU (M1 CPU, this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 80.018 ms

Hello! How can I help you today? Is there something you would like to talk about or learn more about? I'm here to answer any questions you may have.

prompt time: 5553.58 ms / 20 tokens (277.679 ms/token) output time: 3417.43 ms / 35 tokens (97.64 ms/token) total time: 8971.01 ms

- GPU (M1 GPU with Metal,  this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 1 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 122.671 ms

Hello! How can I help you today?

prompt time: 460.668 ms / 20 tokens (23.033 ms/token) output time: 811.612 ms / 10 tokens (81.161 ms/token) total time: 1272.28 ms



Spend Time (Output Time, Lower is better)
|  CPU(master) |  CPU(this pr)   | GPU(this pr) |
|  --------------  | ----  |  ----  |
| 6753.4 ms/token  | 97.64 ms/token | 81.161 ms/token |