Hi, @simonJJJ I am so glad to see you update for M1/M2 support, thanks.
So, I can close my previous PR #39
There are some useful features can help resource limit machines for computing.
Modify max_length for initialization into pipeline setting (when pipeline create), because MacBook Air M1's RAM cannot hold original setting (origin model training setting is too long), It will let gpu out of memory easily, also minimize compute space in lower length setting for kv
Scale MEM_SIZE and SCRATCH_SIZE make reasonable for max_length modification.
Hi, @simonJJJ I am so glad to see you update for
M1/M2 support
, thanks. So, I can close my previous PR #39There are some useful features can help resource limit machines for computing.
max_length
for initialization into pipeline setting (when pipeline create), because MacBook Air M1's RAM cannot hold original setting (origin model training setting is too long), It will let gpu out of memory easily, also minimize compute space in lower length setting forkv
MEM_SIZE
andSCRATCH_SIZE
make reasonable formax_length
modification.There are my experiments in this PR.
Experiments Setting
./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello
Hello! How can I help you today?
prompt time: 4385.79 ms / 20 tokens (219.289 ms/token) output time: 67534 ms / 10 tokens (6753.4 ms/token) total time: 71919.8 ms
I cannot run, because OOM.
./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 80.018 ms
Hello! How can I help you today? Is there something you would like to talk about or learn more about? I'm here to answer any questions you may have.
prompt time: 5553.58 ms / 20 tokens (277.679 ms/token) output time: 3417.43 ms / 35 tokens (97.64 ms/token) total time: 8971.01 ms
./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 1 | BLAS = 1 | SSE3 = 0 | VSX = 0 | inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 | loaded qwen model from qwen7b-ggml.bin within: 122.671 ms
Hello! How can I help you today?
prompt time: 460.668 ms / 20 tokens (23.033 ms/token) output time: 811.612 ms / 10 tokens (81.161 ms/token) total time: 1272.28 ms