Open matthoffner opened 1 year ago
Yeah not implemented in the examples - came here looking to run WizardCoder on MPS (metal) on mac, but no dice. WizardCoder has a different architecture than Llama, and I haven't found any MPS implementation yet - if you do let me know :)
The linked MR shows that we can now pass -DGGML_METAL=on
. I'm assuming specific work is required to support individual models still.
I did end up getting WizardCoder and Metal to work with mlc-llm, but ggml has been plenty fast on cpu for me.
Yes, you can compile -DGGML_METAL=on, it does link the files, but the use of it is not implemented in the "userland" part (the example).
Oh interesting, what results are you getting on mlc-llm vs on CPU?
Re "plenty fast", the speedup of metal is significant (here tested on WizardLM and llama.cpp, because so far I haven't been able to run WizardCoder on metal):
bin/main -m ../../models/wizardLM-7B.ggmlv3.q4_0.bin -n 128 -ngl 0 --ignore-eos --mlock -t 4 -s 42 -n 256 -p "Llama is faster when "
...
llama_print_timings: eval time = 14095.91 ms / 255 runs ( 55.28 ms per token, 18.09 tokens per second)
vs on metal
bin/main -m ../../models/wizardLM-7B.ggmlv3.q4_0.bin -n 128 -ngl 1 --ignore-eos --mlock -t 4 -s 42 -n 256 -p "Llama is faster when "
...
llama_print_timings: eval time = 7464.27 ms / 255 runs ( 29.27 ms per token, 34.16 tokens per second)
Hello! I was curious if anyone has gotten models like MPT and Starcoder to work with GGML and the M1 specifically using Metal/GPU. Thanks.