Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
MIT License
563 stars 50 forks source link

Very High Memory Usage for Llama-3.2-11B-Vision-Instruct-4bit #100

Closed chigkim closed 2 weeks ago

chigkim commented 1 month ago

If I run the following with the image below, Python uses around 56GB memory and the generation speed dramatically decreases as it generates more tokens. Is this normal? It seems pretty high usage for 11b-4bit?

python -m mlx_vlm.generate --model mlx-community/Llama-3.2-11B-Vision-Instruct-4bit --max-tokens 1000 --temp 0.0 --image "$1" --prompt "Describe the image in a comprehensive manner."

boat

I'm on M3 Max 64GB with 56GB GPU limit.

sudo sysctl iogpu.wired_limit_mb=57344

Thanks for your help!

microflyer commented 1 month ago

I'm on M3 Max 36GB and I had exactly the same problem when using mlx-community/Llama-3.2-11B-Vision-Instruct-4bit. It used a large amount of memory even for a small image (The demo image of 2 cats in Readme section). And it became extremely slow when more words are accumulated.
The Qwen2_VL handled the same image pretty well.

chigkim commented 2 weeks ago

@Blaizzy, any chance you could look into this? It makes the model pretty unusable. Thanks so much!

Blaizzy commented 2 weeks ago

Hey @chigkim

Thanks for reporting this.

This is an issue of the VLM archicture design.

It uses a cross-attention instead of MLP projector that's why its so heavy and slow.

To improve this, you can use --resize-image.

CLI

python -m mlx_vlm.generate --model mlx-community/Llama-3.2-11B-Vision-Instruct-4bit --max-tokens 1000 --temp 0.0 --image "$1" --prompt "Describe the image in a comprehensive manner." --resize-shape 224 224

Programmatically

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

kwargs = {"resize_shape": (224, 224)}

# Generate output
output = generate(model, processor, image, formatted_prompt, verbose=False, **kwargs)
print(output)
chigkim commented 2 weeks ago

Thanks for your response. However, feeding the same image to q4 model (and fp16 embedding layers) on Ollama, it uses significantly less memory and much faster.

Blaizzy commented 2 weeks ago

Alright.

Please share the ollama results and I will look into what we can optimise. 👌🏽

chigkim commented 2 weeks ago

Sorry, it took a while for me to get around this. Ollama with llama3.2-vision:11b-instruct-q4_K_M uses 12GB of memory, and it generates at 36.72 tokens/sec. Mlx_vlm with mlx-community/Llama-3.2-11B-Vision-Instruct-4bit uses 56GB, and it generates at 0.473 tokens/sec. That's with --resize-shape 224 224. Ollama doesn't have resize option. Here's the image I used.

boat