Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
MIT License
234 stars 20 forks source link

unable to run paligemma #38

Closed nischalj10 closed 1 month ago

nischalj10 commented 3 months ago

I am trying to run the following code but it is giving error. please assist!

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "google/paligemma-3b-mix-448"
model, processor = load(model_path)

print(processor)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
Traceback (most recent call last):
  File "/Users/namanjain/Desktop/repos/local-recall/models.py", line 15, in <module>
    output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/utils.py", line 809, in generate
    logits, cache = model(input_ids, pixel_values, mask)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 139, in __call__
    input_embeddings, final_attention_mask_4d = self.get_input_embeddings(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 82, in get_input_embeddings
    self._prepare_inputs_for_multimodal(
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 115, in _prepare_inputs_for_multimodal
    final_embedding[image_mask_expanded] = scaled_image_features.flatten()
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
ValueError: NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true
Blaizzy commented 3 months ago

Paligemma doesn't use chat_template.

Just pass the text string as is to generate.

https://github.com/Blaizzy/mlx-vlm/issues/33#issuecomment-2135392535

nischalj10 commented 3 months ago

thanks for the quick revert. i updated the code as suggested.

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "google/paligemma-3b-mix-448"
model, processor = load(model_path)

output = generate(model, processor, "/Users/namanjain/app-data/local-recall/screenshots/1717766288971.png", prompt="describe this screenshot")

print(output)

it takes forever to generate any output. this is not the case with much larger models on my m2 chip. also, the max_tokens param is not configurable and somehow the model generates very few tokens.

Blaizzy commented 3 months ago

Could you share your setup specs?

also, the max_tokens param is not configurable and somehow the model generates very few tokens.

It is configurable, by default it's set to 100 but you can increase it by passing max_tokens argument to the generate function.

nischalj10 commented 3 months ago

here's my generate function -

output = generate(model, processor, "/Users/namanjain/app-data/local-recall/screenshots/1717766288971.png", prompt="elaborately describe this screenshot. what app or website url is this on?", max_tokens=500)

but the model's response is one worded.

my specs are - M2 Air with 8 GM RAM. However, the GPU isn't fully utilised while inference and there's enough capacity to run the model

image
Blaizzy commented 3 months ago

A few of things to note about Paligemma:

  1. Paligemma is not a chat model. It takes single-turn simple instructions and commands (i.e., detect cat, segment cat, Describe this image, What does image show). Read more here.
  2. You are running the full precision model which even on M3 max runs at 5 tokens/s input and 25 tokens/s. For faster inference, you can use 8bit quant available in the MLX-community repo: https://huggingface.co/mlx-community/paligemma-3b-mix-448-8bit.
  3. I would recommend the 224x224 model for your machine, instead of the 448x448 because the higher the resolution the more memory it needs to run: https://huggingface.co/mlx-community/paligemma-3b-mix-224-8bit.

Recommended reading: https://huggingface.co/blog/paligemma

pedrocolon93 commented 1 week ago

@Blaizzy were you able to get these models to output bounding boxes? If I do for either of them something like detect cat as the prompt (on a sample image with 2 cats) it either gives or 2 (which is right but not the bounding box). I saw that elsewhere you were having issues adding the model into the library.

Blaizzy commented 1 week ago

Yes, the model works well for captions and counting objects.

However, indeed there is still a bug when it comes object detection and segmentation.

Sometimes it works and others it doesn't. I haven't managed to pin-point the issue.

Screenshot 2024-09-17 at 12 26 08 PM
Blaizzy commented 1 week ago

Segmentation seems to work better with lower temperatures.