Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
MIT License
144 stars 12 forks source link

Error running inference script in readme using paligemma-3b-mix-448-8bit #33

Closed mobile-appz closed 1 month ago

mobile-appz commented 1 month ago

mlx-vlm Version: 0.0.7 mlx Version: 0.14.0

Great work with this, it's working well apart from when using with PaliGemma in the supplied inference Python script. I'm experiencing an error when running the script found in the readme file, using the paligemma-3b-mix-448-8bit model as per code below:

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "mlx-community/paligemma-3b-mix-448-8bit"

model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
print(output)

The "CLI" and "Chat UI with Gradio" inference steps in the readme are working correctly, with the model set as "mlx-community/paligemma-3b-mix-448-8bit". I'm using Conda and MLX and MLX-VLM has been installed using PIP.

The error is as follows:

NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 115, in _prepare_inputs_for_multimodal final_embedding[image_mask_expanded] = scaled_image_features.flatten()


  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 82, in get_input_embeddings
    self._prepare_inputs_for_multimodal(
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 139, in __call__
    input_embeddings, final_attention_mask_4d = self.get_input_embeddings(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/utils.py", line 809, in generate
    logits, cache = model(input_ids, pixel_values, mask)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mlx-vlm-test.py", line 15, in <module>
    output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda3/envs/mlx/lib/python3.11/runpy.py", line 198, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true
Blaizzy commented 1 month ago

Thanks!

The issue you are facing is because Paligemma doesn't use any chat template or manual image token

You can just pass the text directly.

prompt = "what are these?"

I will add examples for all models soon. Or if you want you can make a PR :)

mobile-appz commented 1 month ago

Thanks!

The issue you are facing is because Paligemma doesn't use any chat template or manual image token

You can just pass the text directly.

prompt = "what are these?"

I will add examples for all models soon. Or if you want you can make a PR :)

Thank you very much for your help and quick response. I can confirm that this works now perfectly.

Blaizzy commented 1 month ago

Most welcome! 🤗