Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
MIT License
494 stars 42 forks source link

Some picture captioning results with various models, for reference #107

Open jrp2014 opened 3 weeks ago

jrp2014 commented 3 weeks ago

I have tried to get several different models to generate captions, keywords, etc, for this image: image

Some of the results are worse (cheesier, less accurate) than others. Some models have quirks / deprecations.

(mlx) ➜  vlm python mytest.py
Model:  mlx-community/llava-v1.6-34b-8bit
Fetching 17 files: 100%|██████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 37351.06it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 8201.42it/s]
Image:  /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg
==========
Image: /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg 

Prompt: <|im_start|>user
<image>
Provide a formal caption, description and comma-separated keywords or tags for this image, suitable for microstock<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Caption: "Historic town square with traditional architecture and a clock tower."

Description: The image captures a quaint town square at dusk. The square is dominated by a large, white building with a prominent clock tower. The architecture of the building is traditional, featuring a steep roof and a central window with a pointed arch. The clock tower stands tall, its face illuminated against the fading light of the day.

In the foreground, a circular fountain serves as a focal point, surrounded by a few cars parked along the street. The cobblestone street adds to the charm of the scene, leading the eye towards the white building and the clock tower. The sky above is a canvas of warm hues, suggesting the sun has recently set.

Keywords: town square, historic architecture, clock tower, cobblestone street, dusk, traditional building, fountain, cars, street, town, cityscape, European architecture, old town, heritage site, urban landscape, evening, sunset, travel destination, tourism, landmark, city life, public space, community, culture, history, preservation, urban design, architecture, city, town, European, heritage, tourism, travel, culture, history
==========
Prompt: 0.819 tokens-per-sec
Generation: 5.499 tokens-per-sec
Caption: "Historic town square with traditional architecture and a clock tower."

Description: The image captures a quaint town square at dusk. The square is dominated by a large, white building with a prominent clock tower. The architecture of the building is traditional, featuring a steep roof and a central window with a pointed arch. The clock tower stands tall, its face illuminated against the fading light of the day.

In the foreground, a circular fountain serves as a focal point, surrounded by a few cars parked along the street. The cobblestone street adds to the charm of the scene, leading the eye towards the white building and the clock tower. The sky above is a canvas of warm hues, suggesting the sun has recently set.

Keywords: town square, historic architecture, clock tower, cobblestone street, dusk, traditional building, fountain, cars, street, town, cityscape, European architecture, old town, heritage site, urban landscape, evening, sunset, travel destination, tourism, landmark, city life, public space, community, culture, history, preservation, urban design, architecture, city, town, European, heritage, tourism, travel, culture, history
(mlx) ➜  vlm python mytest.py
Model:  mlx-community/pixtral-12b-8bit
Fetching 11 files: 100%|██████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 51898.02it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 15719.71it/s]
Image:  /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg
==========
Image: /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg 

Prompt: <s>[INST][IMG]Provide a formal caption, description and comma-separated keywords or tags for this image, suitable for microstock[/INST]
**Caption:**
A picturesque town square at dusk, featuring a historic building with a prominent statue and clock tower.

**Description:**
The image captures a charming town square during the evening hours. The central focus is a historic building with a white facade, adorned with a large arched window and a balcony. In front of the building stands a statue on a pedestal, surrounded by a circular garden. To the right of the building is a tall clock tower with a stone exterior. The square is paved with cobblestones and includes a roundabout with a few parked cars. The sky is painted with hues of blue and orange, indicating the setting sun.

**Keywords:**
town square, historic building, statue, clock tower, evening, sunset, cobblestone, roundabout, cars, pedestrians, architecture, urban, heritage, European, street, sky, dusk
==========
Prompt: 3.145 tokens-per-sec
Generation: 20.908 tokens-per-sec
**Caption:**
A picturesque town square at dusk, featuring a historic building with a prominent statue and clock tower.

**Description:**
The image captures a charming town square during the evening hours. The central focus is a historic building with a white facade, adorned with a large arched window and a balcony. In front of the building stands a statue on a pedestal, surrounded by a circular garden. To the right of the building is a tall clock tower with a stone exterior. The square is paved with cobblestones and includes a roundabout with a few parked cars. The sky is painted with hues of blue and orange, indicating the setting sun.

**Keywords:**
town square, historic building, statue, clock tower, evening, sunset, cobblestone, roundabout, cars, pedestrians, architecture, urban, heritage, European, street, sky, dusk
(mlx) ➜  vlm python mytest.py
Model:  mlx-community/llava-v1.6-mistral-7b-8bit
Fetching 12 files: 100%|██████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 61832.49it/s]
Fetching 12 files: 100%|██████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18634.45it/s]
Image:  /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg
==========
Image: /Users/x/Pictures/Processed/20241026-174148_L1010280.jpg 

Prompt: [INST] <image>
Provide a formal caption, description and comma-separated keywords or tags for this image, suitable for microstock [/INST]
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Formal Caption: "Historic European Town Square at Dusk with Old Building and Statue."

Description: "This image captures the serene ambiance of a historic European town square at dusk. The square is dominated by a large, old building with a distinctive pointed roof and a clock tower. The building's facade is adorned with a statue, adding to its grandeur. The square is paved with cobblestones and features a fountain in the center, surrounded by a circular driveway. The sky is painted with hues of orange and pink, indicating the setting sun. The scene is further enhanced by the presence of a few cars parked around the square, and a few pedestrians can be seen walking around the area. The overall atmosphere is peaceful and picturesque, typical of a charming European town."

Keywords or Tags: "historic, europe, town square, dusk, old building, pointed roof, clock tower, statue, cobblestone, fountain, circular driveway, setting sun, cars, pedestrians, charming, picturesque." 
==========
Prompt: 30.116 tokens-per-sec
Generation: 36.198 tokens-per-sec
Formal Caption: "Historic European Town Square at Dusk with Old Building and Statue."

Description: "This image captures the serene ambiance of a historic European town square at dusk. The square is dominated by a large, old building with a distinctive pointed roof and a clock tower. The building's facade is adorned with a statue, adding to its grandeur. The square is paved with cobblestones and features a fountain in the center, surrounded by a circular driveway. The sky is painted with hues of orange and pink, indicating the setting sun. The scene is further enhanced by the presence of a few cars parked around the square, and a few pedestrians can be seen walking around the area. The overall atmosphere is peaceful and picturesque, typical of a charming European town."

Keywords or Tags: "historic, europe, town square, dusk, old building, pointed roof, clock tower, statue, cobblestone, fountain, circular driveway, setting sun, cars, pedestrians, charming, picturesque." 
(mlx) ➜  vlm python mytest.py
Model:  mlx-community/Phi-3.5-vision-instruct-bf16
Fetching 13 files: 100%|██████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 28135.17it/s]
The repository for /Users/x/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/x/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:520: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 13 files: 100%|██████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 23645.25it/s]
Image:  /Users/jrp/Pictures/Processed/20241026-174148_L1010280.jpg
==========
Image: /Users/jrp/Pictures/Processed/20241026-174148_L1010280.jpg 

Prompt: <|user|>
<|image_1|>Provide a formal caption, description and comma-separated keywords or tags for this image, suitable for microstock<|end|>
<|assistant|>

A serene town square with a prominent clock tower, a statue, and a mix of traditional and modern buildings. The sky is painted with hues of sunset, casting a warm glow over the scene. Cars are parked along the street, and the atmosphere is calm and peaceful.<|end|>
==========
Prompt: 33.464 tokens-per-sec
Generation: 8.177 tokens-per-sec
A serene town square with a prominent clock tower, a statue, and a mix of traditional and modern buildings. The sky is painted with hues of sunset, casting a warm glow over the scene. Cars are parked along the street, and the atmosphere is calm and peaceful.<|end|>

The script that I used was:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

import os
from pathlib import Path

# model_path = "mlx-community/llava-1.5-7b-4bit"
# model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
# model_path = "mistral-community/pixtral-12b"
# model_path = "mlx-community/pixtral-12b-8bit"
# model_path = "Qwen/Qwen2-VL-7B-Instruct"  # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 269535412224 bytes which is greater than the maximum allowed buffer size of 28991029248 bytes.#
# model_path = "mlx-community/llava-v1.6-34b-8bit"
model_path = "mlx-community/Phi-3.5-vision-instruct-bf16" # OK
# model_path = "mistral-community/pixtral-12b"
# model_path = "meta-llama/Llama-3.2-11B-Vision-Instruct"  # Memory

print("Model: ", model_path)

# Load the model
model, processor = load(model_path)
config = load_config(model_path)

prompt = "Provide a formal caption, description and comma-separated keywords or tags for this image, suitable for microstock"

picpath = "/Users/x/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print("Image: ", pic)

# Apply chat template
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

# Generate output
output = generate(model, processor, pic, formatted_prompt, max_tokens=250, verbose=True)
print(output)
jrp2014 commented 1 day ago

Same script, different pic / laptop:

The image

I leave you to draw your own conclusions but I see that meta-llama/Llama-3.2-11B-Vision-Instruct wants to produce nice results. It is, however, very very slow. Not sure why. It is not using all my machine's memory, and the cpu usage, at least, seems v low. Anyway, I gave up.

(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model:  mlx-community/llava-1.5-7b-4bit
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 37663.14it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 13129.58it/s]
Image:  20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg 

Prompt: USER: <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily ASSISTANT:
Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
A group of people walking down a street in front of a building. The street is lined with shops and a traffic light. The people are carrying handbags and a baby carriage.
==========
Prompt: 34.096 tokens-per-sec
Generation: 67.423 tokens-per-sec
A group of people walking down a street in front of a building. The street is lined with shops and a traffic light. The people are carrying handbags and a baby carriage.
python mytest.py  2.46s user 4.32s system 107% cpu 6.285 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model:  mlx-community/llava-v1.6-mistral-7b-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 55067.45it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 39850.87it/s]
Image:  20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg 

Prompt: [INST] <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily [/INST]
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Factual Caption: A quiet street in a historic town at dusk.

Description: The image depicts a serene evening scene in a European-style town. The street is lined with historic buildings, painted in a traditional color palette of white and black, with some featuring green roofs. The architecture suggests a blend of medieval and Renaissance styles. The street is illuminated by streetlights, casting a warm glow on the cobblestone pavement. A few pedestrians are visible, walking along the sidewalk, and a few cars are parked or moving along the road. The sky is overcast, and the overall atmosphere is calm and peaceful.

Keywords or Tags: historic town, cobblestone street, European architecture, evening, streetlights, pedestrians, cars, overcast sky, traditional color palette, green roofs, medieval, Renaissance, quiet, serene, dusk. 
==========
Prompt: 33.712 tokens-per-sec
Generation: 47.399 tokens-per-sec
Factual Caption: A quiet street in a historic town at dusk.

Description: The image depicts a serene evening scene in a European-style town. The street is lined with historic buildings, painted in a traditional color palette of white and black, with some featuring green roofs. The architecture suggests a blend of medieval and Renaissance styles. The street is illuminated by streetlights, casting a warm glow on the cobblestone pavement. A few pedestrians are visible, walking along the sidewalk, and a few cars are parked or moving along the road. The sky is overcast, and the overall atmosphere is calm and peaceful.

Keywords or Tags: historic town, cobblestone street, European architecture, evening, streetlights, pedestrians, cars, overcast sky, traditional color palette, green roofs, medieval, Renaissance, quiet, serene, dusk. 
python mytest.py  3.29s user 5.87s system 102% cpu 8.919 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model:  mlx-community/llava-v1.6-34b-8bit
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 25329.72it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 16268.12it/s]
Image:  20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg 

Prompt: <|im_start|>user
<image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Caption: A bustling city street with a mix of modern and traditional architecture.

Description: The image captures a lively city scene at dusk. The street is lined with a variety of buildings, including a prominent half-timbered structure that adds a touch of historical charm to the modern urban landscape. People are seen walking along the sidewalk, going about their daily routines. The traffic light is glowing red, indicating a pause in the flow of vehicles. The sky overhead is filled with clouds, suggesting an overcast day.

Keywords: city street, half-timbered building, modern architecture, traditional architecture, pedestrians, traffic light, overcast sky, dusk, urban landscape, city life, buildings, people, vehicles, street, sidewalk, traffic, cityscape, urban, architecture, historical, modern, cloudy, red light, pedestrians, cars, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light,
==========
Prompt: 8.449 tokens-per-sec
Generation: 10.181 tokens-per-sec
Caption: A bustling city street with a mix of modern and traditional architecture.

Description: The image captures a lively city scene at dusk. The street is lined with a variety of buildings, including a prominent half-timbered structure that adds a touch of historical charm to the modern urban landscape. People are seen walking along the sidewalk, going about their daily routines. The traffic light is glowing red, indicating a pause in the flow of vehicles. The sky overhead is filled with clouds, suggesting an overcast day.

Keywords: city street, half-timbered building, modern architecture, traditional architecture, pedestrians, traffic light, overcast sky, dusk, urban landscape, city life, buildings, people, vehicles, street, sidewalk, traffic, cityscape, urban, architecture, historical, modern, cloudy, red light, pedestrians, cars, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light,
python mytest.py  7.84s user 19.89s system 42% cpu 1:05.21 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model:  mlx-community/Phi-3.5-vision-instruct-bf16
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 52028.58it/s]
The repository for /Users/xxx/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/xxx/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:520: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 39397.36it/s]
Image:  20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg 

Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|end|>
<|assistant|>

Street view with traditional half-timbered houses, pedestrians, traffic lights, and vehicles. Keywords: "half-timbered houses", "pedestrians", "traffic lights", "vehicles", "street view".<|end|>
==========
Prompt: 39.173 tokens-per-sec
Generation: 10.209 tokens-per-sec
Street view with traditional half-timbered houses, pedestrians, traffic lights, and vehicles. Keywords: "half-timbered houses", "pedestrians", "traffic lights", "vehicles", "street view".<|end|>
python mytest.py  2.81s user 2.74s system 50% cpu 10.892 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model:  meta-llama/Llama-3.2-11B-Vision-Instruct
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 19691.57it/s]
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 8341.89it/s]
Image:  20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>

The image depicts a street scene in an English town, with a large white and black half-timbered building as the central focus. The building features a brown roof and a sign that reads "The Kings Head" in red letters, indicating it is a pub or restaurant. The street is lined with people walking and cars driving, and there are several other buildings visible in the background, including a blue bus and a red brick building.

The overall atmosphere of the image is one of a quiet, small-town street scene, with people going about their daily business. The image captures a moment in time, with the cloudy sky and the people's activities creating a sense of calm and tranquility. The image could be used to represent a typical English town or village, with its traditional architecture and laid-back atmosphere. 

Keywords: English town, street scene, half-timbered building, pub, restaurant, people, cars, bus, brick buildings, cloudy sky, calm, tranquility, traditional architecture, small-town, village, daily^C^C^C^CTraceback (most recent call last):
  File "/Users/xxx/Documents/AI/mlx/scripts/vlm/mytest.py", line 36, in <module>
    output = generate(model, processor, pic, formatted_prompt, max_tokens=500, verbose=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1076, in generate
    for (token, prob), n in zip(generator, range(max_tokens)):
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 932, in generate_step
    yield y.item(), logprobs
          ^^^^^^^^
KeyboardInterrupt
^C
python mytest.py  5.64s user 841.10s system 77% cpu 18:15.46 total
Blaizzy commented 1 day ago

Thanks!

I will look into it.

There is another issue reporting the same problem.