Closed chigkim closed 2 weeks ago
I'm on M3 Max 36GB and I had exactly the same problem when using mlx-community/Llama-3.2-11B-Vision-Instruct-4bit. It used a large amount of memory even for a small image (The demo image of 2 cats in Readme section). And it became extremely slow when more words are accumulated.
The Qwen2_VL handled the same image pretty well.
@Blaizzy, any chance you could look into this? It makes the model pretty unusable. Thanks so much!
Hey @chigkim
Thanks for reporting this.
This is an issue of the VLM archicture design.
It uses a cross-attention instead of MLP projector that's why its so heavy and slow.
To improve this, you can use --resize-image
.
CLI
python -m mlx_vlm.generate --model mlx-community/Llama-3.2-11B-Vision-Instruct-4bit --max-tokens 1000 --temp 0.0 --image "$1" --prompt "Describe the image in a comprehensive manner." --resize-shape 224 224
Programmatically
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
kwargs = {"resize_shape": (224, 224)}
# Generate output
output = generate(model, processor, image, formatted_prompt, verbose=False, **kwargs)
print(output)
Thanks for your response. However, feeding the same image to q4 model (and fp16 embedding layers) on Ollama, it uses significantly less memory and much faster.
Alright.
Please share the ollama results and I will look into what we can optimise. 👌🏽
Sorry, it took a while for me to get around this.
Ollama with llama3.2-vision:11b-instruct-q4_K_M uses 12GB of memory, and it generates at 36.72 tokens/sec.
Mlx_vlm with mlx-community/Llama-3.2-11B-Vision-Instruct-4bit uses 56GB, and it generates at 0.473 tokens/sec.
That's with --resize-shape 224 224
.
Ollama doesn't have resize option.
Here's the image I used.
If I run the following with the image below, Python uses around 56GB memory and the generation speed dramatically decreases as it generates more tokens. Is this normal? It seems pretty high usage for 11b-4bit?
python -m mlx_vlm.generate --model mlx-community/Llama-3.2-11B-Vision-Instruct-4bit --max-tokens 1000 --temp 0.0 --image "$1" --prompt "Describe the image in a comprehensive manner."
I'm on M3 Max 64GB with 56GB GPU limit.
sudo sysctl iogpu.wired_limit_mb=57344
Thanks for your help!