Phi-3.5-vision-instruct-bf16 fails

jrp2014 commented 2 months ago

Running this script:

import mlx.core as mx
from mlx_vlm import load, generate

import os
from pathlib import Path

# model_path = "mlx-community/llava-1.5-7b-4bit"
#model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
#model_path = "mlx-community/llava-v1.6-34b-8bit"
model_path = "mlx-community/Phi-3.5-vision-instruct-bf16"
model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nProvide a formal caption and keywords for this image"}],
    tokenize=False,
    add_generation_prompt=True,
)

picpath = "/Users/jrp/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print(pic)

#output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=True)
output = generate(model, processor, pic, prompt, max_tokens=200, verbose=True)

print(output)

... all's well with model_path = "mlx-community/llava-v1.6-mistral-7b-8bit", but with model_path = "mlx-community/Phi-3.5-vision-instruct-bf16" I get

python mytest.py
Fetching 13 files: 100%|████████████████████████| 13/13 [00:00<00:00, 173098.26it/s]
The repository for /Users/jrp/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/jrp/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/Users/jrp/Pictures/Processed/20240826-153133_DSC02241.jpg
==========
Image: /Users/jrp/Pictures/Processed/20240826-153133_DSC02241.jpg 

Prompt: <|user|>
<image>
Provide a formal caption and keywords for this image<|end|>
<|assistant|>

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/mytest.py", line 26, in <module>
    output = generate(model, processor, pic, prompt, max_tokens=200, verbose=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 908, in generate
    input_ids, pixel_values, mask = prepare_inputs(
                                    ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 702, in prepare_inputs
    inputs = processor(prompt, image, return_tensors="np")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/.cache/huggingface/modules/transformers_modules/a2c307b346fcff186a6250ea4c0853688db7b633/processing_phi3_v.py", line 377, in __call__
    inputs = self._convert_images_texts_to_inputs(image_inputs, text, padding=padding, truncation=truncation, max_length=max_length, return_tensors=return_tensors)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/.cache/huggingface/modules/transformers_modules/a2c307b346fcff186a6250ea4c0853688db7b633/processing_phi3_v.py", line 435, in _convert_images_texts_to_inputs
    assert len(unique_image_ids) == len(images), f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"
AssertionError: total images must be the same as the number of image tags, got 0 image tags and 1 images