Open jrp2014 opened 3 weeks ago
Same script, different pic / laptop:
I leave you to draw your own conclusions but I see that meta-llama/Llama-3.2-11B-Vision-Instruct wants to produce nice results. It is, however, very very slow. Not sure why. It is not using all my machine's memory, and the cpu usage, at least, seems v low. Anyway, I gave up.
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model: mlx-community/llava-1.5-7b-4bit
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 37663.14it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 13129.58it/s]
Image: 20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg
Prompt: USER: <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily ASSISTANT:
Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
A group of people walking down a street in front of a building. The street is lined with shops and a traffic light. The people are carrying handbags and a baby carriage.
==========
Prompt: 34.096 tokens-per-sec
Generation: 67.423 tokens-per-sec
A group of people walking down a street in front of a building. The street is lined with shops and a traffic light. The people are carrying handbags and a baby carriage.
python mytest.py 2.46s user 4.32s system 107% cpu 6.285 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model: mlx-community/llava-v1.6-mistral-7b-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 55067.45it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 39850.87it/s]
Image: 20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg
Prompt: [INST] <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily [/INST]
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Factual Caption: A quiet street in a historic town at dusk.
Description: The image depicts a serene evening scene in a European-style town. The street is lined with historic buildings, painted in a traditional color palette of white and black, with some featuring green roofs. The architecture suggests a blend of medieval and Renaissance styles. The street is illuminated by streetlights, casting a warm glow on the cobblestone pavement. A few pedestrians are visible, walking along the sidewalk, and a few cars are parked or moving along the road. The sky is overcast, and the overall atmosphere is calm and peaceful.
Keywords or Tags: historic town, cobblestone street, European architecture, evening, streetlights, pedestrians, cars, overcast sky, traditional color palette, green roofs, medieval, Renaissance, quiet, serene, dusk.
==========
Prompt: 33.712 tokens-per-sec
Generation: 47.399 tokens-per-sec
Factual Caption: A quiet street in a historic town at dusk.
Description: The image depicts a serene evening scene in a European-style town. The street is lined with historic buildings, painted in a traditional color palette of white and black, with some featuring green roofs. The architecture suggests a blend of medieval and Renaissance styles. The street is illuminated by streetlights, casting a warm glow on the cobblestone pavement. A few pedestrians are visible, walking along the sidewalk, and a few cars are parked or moving along the road. The sky is overcast, and the overall atmosphere is calm and peaceful.
Keywords or Tags: historic town, cobblestone street, European architecture, evening, streetlights, pedestrians, cars, overcast sky, traditional color palette, green roofs, medieval, Renaissance, quiet, serene, dusk.
python mytest.py 3.29s user 5.87s system 102% cpu 8.919 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model: mlx-community/llava-v1.6-34b-8bit
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 25329.72it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 16268.12it/s]
Image: 20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg
Prompt: <|im_start|>user
<image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|im_end|>
<|im_start|>assistant
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Caption: A bustling city street with a mix of modern and traditional architecture.
Description: The image captures a lively city scene at dusk. The street is lined with a variety of buildings, including a prominent half-timbered structure that adds a touch of historical charm to the modern urban landscape. People are seen walking along the sidewalk, going about their daily routines. The traffic light is glowing red, indicating a pause in the flow of vehicles. The sky overhead is filled with clouds, suggesting an overcast day.
Keywords: city street, half-timbered building, modern architecture, traditional architecture, pedestrians, traffic light, overcast sky, dusk, urban landscape, city life, buildings, people, vehicles, street, sidewalk, traffic, cityscape, urban, architecture, historical, modern, cloudy, red light, pedestrians, cars, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light,
==========
Prompt: 8.449 tokens-per-sec
Generation: 10.181 tokens-per-sec
Caption: A bustling city street with a mix of modern and traditional architecture.
Description: The image captures a lively city scene at dusk. The street is lined with a variety of buildings, including a prominent half-timbered structure that adds a touch of historical charm to the modern urban landscape. People are seen walking along the sidewalk, going about their daily routines. The traffic light is glowing red, indicating a pause in the flow of vehicles. The sky overhead is filled with clouds, suggesting an overcast day.
Keywords: city street, half-timbered building, modern architecture, traditional architecture, pedestrians, traffic light, overcast sky, dusk, urban landscape, city life, buildings, people, vehicles, street, sidewalk, traffic, cityscape, urban, architecture, historical, modern, cloudy, red light, pedestrians, cars, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light, overcast, sky, dusk, urban, landscape, city, life, buildings, people, traffic, light,
python mytest.py 7.84s user 19.89s system 42% cpu 1:05.21 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model: mlx-community/Phi-3.5-vision-instruct-bf16
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 52028.58it/s]
The repository for /Users/xxx/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/xxx/.cache/huggingface/hub/models--mlx-community--Phi-3.5-vision-instruct-bf16/snapshots/a2c307b346fcff186a6250ea4c0853688db7b633.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:520: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 39397.36it/s]
Image: 20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg
Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|end|>
<|assistant|>
Street view with traditional half-timbered houses, pedestrians, traffic lights, and vehicles. Keywords: "half-timbered houses", "pedestrians", "traffic lights", "vehicles", "street view".<|end|>
==========
Prompt: 39.173 tokens-per-sec
Generation: 10.209 tokens-per-sec
Street view with traditional half-timbered houses, pedestrians, traffic lights, and vehicles. Keywords: "half-timbered houses", "pedestrians", "traffic lights", "vehicles", "street view".<|end|>
python mytest.py 2.81s user 2.74s system 50% cpu 10.892 total
(mlx) xxx@xxxs-MacBook-Pro vlm % time python mytest.py
Model: meta-llama/Llama-3.2-11B-Vision-Instruct
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 19691.57it/s]
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 8341.89it/s]
Image: 20241116-164633_L1010479_DxO.jpg
==========
Image: /Users/xxx/Pictures/Processed/20241116-164633_L1010479_DxO.jpg
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
The image depicts a street scene in an English town, with a large white and black half-timbered building as the central focus. The building features a brown roof and a sign that reads "The Kings Head" in red letters, indicating it is a pub or restaurant. The street is lined with people walking and cars driving, and there are several other buildings visible in the background, including a blue bus and a red brick building.
The overall atmosphere of the image is one of a quiet, small-town street scene, with people going about their daily business. The image captures a moment in time, with the cloudy sky and the people's activities creating a sense of calm and tranquility. The image could be used to represent a typical English town or village, with its traditional architecture and laid-back atmosphere.
Keywords: English town, street scene, half-timbered building, pub, restaurant, people, cars, bus, brick buildings, cloudy sky, calm, tranquility, traditional architecture, small-town, village, daily^C^C^C^CTraceback (most recent call last):
File "/Users/xxx/Documents/AI/mlx/scripts/vlm/mytest.py", line 36, in <module>
output = generate(model, processor, pic, formatted_prompt, max_tokens=500, verbose=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1076, in generate
for (token, prob), n in zip(generator, range(max_tokens)):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 932, in generate_step
yield y.item(), logprobs
^^^^^^^^
KeyboardInterrupt
^C
python mytest.py 5.64s user 841.10s system 77% cpu 18:15.46 total
Thanks!
I will look into it.
There is another issue reporting the same problem.
I have tried to get several different models to generate captions, keywords, etc, for this image:
Some of the results are worse (cheesier, less accurate) than others. Some models have quirks / deprecations.
The script that I used was: