MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
The easiest way to get started is to install the mlx-vlm
package using pip:
pip install mlx-vlm
Generate output from a model using the CLI:
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temp 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg
Launch a chat interface using Gradio:
python -m mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Here's an example of how to use MLX-VLM in a Python script:
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, image, formatted_prompt, verbose=False)
print(output)
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
The following models support multi-image chat:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)
output = generate(model, processor, images, formatted_prompt, verbose=False)
print(output)
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
To learn more about LoRA, please refer to the LoRA.md file.