armbues / SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.
MIT License
207 stars 21 forks source link

slowness of sillm.chat on M2 Air with 16GB Ram #5

Closed kylewadegrove closed 4 months ago

kylewadegrove commented 4 months ago

Any invocation of python -m sillm.chat model seems much slower on my machine than in the reference video--more than a minute to get to the prompt, and maybe 1-2 TPM in the response.

I have tried sillm.chat with two different models downloaded from HF via the download.py scripts in the SiLLM-examples repo: [Mistral-7B-Instruct-v0.2] and [Qwen1.5-7B-Chat]; a LLAMA3 model that I downloaded directly from HF exhibited the same behavior.

Machine specs: MacBook Air M2 16GB memory on Sonoma 14.4.1, running Python 3.12.3 in a conda environment.

kylewadegrove commented 4 months ago

This might be an mlx-lm issue, as mlx_lm.generate is also very slow.

magnusviri commented 4 months ago

I can run sillm.chat with Llama3 8b at 10.79 tok/sec on a MacBook Air M1 2020 w/ 16GB RAM. I'm using Python 3.11.

armbues commented 4 months ago

I suspect the inference to be memory-constrained in this case, if you're trying to run the full 7B and 8B models. Without quantization, Llama-3 8B will take 15,316 MB of memory on my Mac Studio before even starting the chat. This means your 16 GB MacBook Air starts swapping memory with the disk and the speed drops significantly.

Try quantizing the model (argument -q4 or -q8) when running sillm.chat. On my MacBook Air M2 16GB (sounds like the same config) I'm getting 9.20 tok/sec with Llama-3-8b-instruct quantized to 8-bit with under 8 GB of memory used.

FYI the reference video with the MacBook Air is using the Gemma-2B-it model, which is a small/fast model and larger models (7B and 8B) will run slower.

kylewadegrove commented 4 months ago

Seems to it, quantized to 4 bit had reasonable performance.

magnusviri commented 4 months ago

Duh, I can't believe I forgot to mention my Llama3 8b was 4-bit quantized...