Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
MIT License
144 stars 12 forks source link

Too sensitive to prompting #23

Closed cmgzy closed 1 month ago

cmgzy commented 1 month ago

I found some VLMs are too sensitive to prompt. For example, when I use mlx-community/llava-1.5-7b-4bit: the image is: image

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "how many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is (which is correct): There are nine dogs in the image.

but if I change the prompt to "How many dogs in the image?"..

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "How many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is wrong: There are seven dogs in the image. I also tried llava-llama-3-8b-v1_1-8bit/llava-phi-3-mini-8bit/idefics2-8b-chatty-8bit with both "how..." and "How...", but response were wrong all the time.

Blaizzy commented 1 month ago

Thanks for sharing!

I will look into this 👌🏽

cmgzy commented 1 month ago

FYI, I found that https://huggingface.co/liuhaotian/llava-v1.6-34b doing very well on the dog number test. I've tested on this demo: https://llava.hliu.cc/ I made up several clips with various dog number and prompted with "How many dogs are there in the image? Answer the question using a single word or phrase.", it always answered correctly.

image
Blaizzy commented 1 month ago

Awesome!

That model is based on the llava-next architecture which we don't support at the moment.

Would you like to make a PR to add it ?

Blaizzy commented 1 month ago

Did you test the transformer versions of the previous models you reported ?

cmgzy commented 1 month ago

Did you test the transformer versions of the previous models you reported ?

well, It's kind of difficult to do that since my laptop has 16GB RAM only. Transformer versions without great quantization running too slow...

Blaizzy commented 1 month ago

No worries. I will run those tests in a few

Blaizzy commented 1 month ago

@cmgzy I ran some tests.

And it turns out the models give accurate answers if you run them in full precision or in 8bit

The problem is the mlx 4bit quantisation. The latest mlx release (v0.13.0) fixes this and the new 4bit model answers correctly.

I'm uploading it and also adding 8bit.

Please give it try and let me know if you find any other issues.

cmgzy commented 1 month ago

@Blaizzy for mlx-community/llava-1.5-7b-8bit, model files are not ready. for mlx-community/llava-1.5-7b-4bit, cannot run with mlx==0.13.0 and mlx-vlm==0.0.4, error info: Traceback (most recent call last): File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/chenmi/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 107, in main() File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/generate.py", line 92, in main output = generate( File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/utils.py", line 758, in generate logits, cache = model(input_ids, pixel_values) File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 134, in call input_embddings = self.get_input_embeddings(input_ids, pixel_values) File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/llava.py", line 77, in get_inputembeddings *, hidden_states = self.vision_tower( File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 220, in call return self.vision_model(x, output_hidden_states) File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 193, in call x = self.embeddings(x) File "/Users/chenmi/miniforge3/lib/python3.10/site-packages/mlx_vlm/models/llava/vision.py", line 176, in call embeddings += self.position_embedding.weight ValueError: Shapes (1,577,1024) and (577,128) cannot be broadcast.

Blaizzy commented 1 month ago

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

cmgzy commented 1 month ago

Could you install from source?

I'm working on this branch:

https://github.com/Blaizzy/mlx-vlm/tree/pc/quantise-irregular

Just clone it and:

pip install -e .

I haven't updated the pip.

It works for mlx-community/llava-1.5-7b-4bit

Blaizzy commented 1 month ago

Awesome!

I will update all 4bit models with the latest mlx core 👌🏽

Blaizzy commented 1 month ago

And will update the pypi today as well.

Is there anything else you want me address?

cmgzy commented 1 month ago

And will update the pypi today as well.

Is there anything else you want me address?

There is.

For example: python -m mlx_vlm.generate --model mlx-community/llava-llama-3-8b-v1_1-8bit \ --prompt "How many dogs are there in the image? Answer the question using a single word or phrase." --image "/Users/chenmi/Pictures/xx.jpg" \ --max-tokens 512 --temp 0.2

The response is weird...

Image: /Users/chenmi/Pictures/xx.jpg

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

How many dogs are there in the image? Answer the question using a single word or phrase.<|eot_id|><|start_header_id|>assistant<|end_header_id|> 6<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|> all the dogs are black and white<|eot_id|><|eot_id|><|eot_id|><|eot_id|> except for 2<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>
Blaizzy commented 1 month ago

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

Blaizzy commented 1 month ago

Regarding the answer the model gave

I'm yet to update the quantisation. I will ping you once I update it.

cmgzy commented 1 month ago

Just patched the tokenizer

Can you redownload it from the hub and let me know if the issue persists?

It fixed!

Blaizzy commented 1 month ago

Fantastic!

I'm uploading llava 8bit then I will patch other models 👌🏽

cmgzy commented 1 month ago

Did you test the transformer versions of the previous models you reported ?

Could you share sample code for testing transformer versions using Apple Silicon GPUs? So that I may help you test other models from then on. Thx!

Blaizzy commented 1 month ago

Here you go:

from transformers import AutoProcessor, AutoModelForPreTraining
from PIL import Image
import requests
import torch

model_id = "llava-hf/llava-1.5-7b-hf"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = AutoModelForPreTraining.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)
cmgzy commented 1 month ago

Here you go:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "mlx-community/llava-1.5-7b-4bit"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Instruct the model to create a caption in Spanish
prompt = "Caption this image"
inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation[0], skip_special_tokens=True)
print(decoded)

Which model should I use to load "mlx-community/llava-1.5-7b-4bit", may not be PaliGemmaForConditionalGeneration...

Blaizzy commented 1 month ago

Sorry, fixed it ✅

Use AutoModelForPreTraining