ggerganov / ggml

Tensor library for machine learning
MIT License
11.25k stars 1.05k forks source link

Metal support #397

Open matthoffner opened 1 year ago

matthoffner commented 1 year ago

Hello! I was curious if anyone has gotten models like MPT and Starcoder to work with GGML and the M1 specifically using Metal/GPU. Thanks.

mateusz commented 1 year ago

Yeah not implemented in the examples - came here looking to run WizardCoder on MPS (metal) on mac, but no dice. WizardCoder has a different architecture than Llama, and I haven't found any MPS implementation yet - if you do let me know :)

matthoffner commented 1 year ago

The linked MR shows that we can now pass -DGGML_METAL=on. I'm assuming specific work is required to support individual models still.

I did end up getting WizardCoder and Metal to work with mlc-llm, but ggml has been plenty fast on cpu for me.

mateusz commented 1 year ago

Yes, you can compile -DGGML_METAL=on, it does link the files, but the use of it is not implemented in the "userland" part (the example).

Oh interesting, what results are you getting on mlc-llm vs on CPU?

Re "plenty fast", the speedup of metal is significant (here tested on WizardLM and llama.cpp, because so far I haven't been able to run WizardCoder on metal):

bin/main -m ../../models/wizardLM-7B.ggmlv3.q4_0.bin -n 128 -ngl 0 --ignore-eos --mlock -t 4 -s 42 -n 256 -p "Llama is faster when "
...
llama_print_timings:        eval time = 14095.91 ms /   255 runs   (   55.28 ms per token,    18.09 tokens per second)

vs on metal

bin/main -m ../../models/wizardLM-7B.ggmlv3.q4_0.bin -n 128 -ngl 1 --ignore-eos --mlock -t 4 -s 42 -n 256 -p "Llama is faster when "
...
llama_print_timings:        eval time =  7464.27 ms /   255 runs   (   29.27 ms per token,    34.16 tokens per second)