LLaMA 3.2 11B Vision Support

Blaizzy commented 1 month ago

It's coming

jrp2014 commented 1 month ago

I tried it using the python code from hugging face. It works OK but, for keyboarding / captioning it doesn't seem to like producing keywords, producing, instead, a bunch of prose. This repo with LLava 1.6+mistral is slightly faster and produces keywords as asked.

Blaizzy commented 1 month ago

What could you elaborate and show some examples?

yukiarimo commented 1 month ago

Any updates?

Blaizzy commented 1 month ago

Not yet. My plate if full with #41

jrp2014 commented 1 month ago

What could you elaborate and show some examples?

Running:

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
import os
from pathlib import Path

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="mps",
)
processor = AutoProcessor.from_pretrained(model_id)

picpath = "/Users/xxx/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print(pic)

#url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
#image = Image.open(requests.get(url, stream=True).raw)

image = Image.open(pic)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
 #       {"type": "text", "text": "Provide a caption and a list of keywords for this image, suitable for microstock."}
 #       {"type": "text", "text": "Analyze this image and provide a title (max 70 characters), description (max 200 characters), and comma-separated keywords suitable for microstock photography websites."}
        {"type": "text", "text": "You are an AI assistant that helps people craft a clear and detailed sentence that describes the content depicted in an image.  Then generate a list of descriptive, comma-separated tags for the following image. Analyze the image carefully and produce tags that accurately represent the image. Ensure the tags are relevant."}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0]))

produces:

(mlx) ➜  mlx_vlm git:(main) ✗ python llama32.py 
Downloading shards: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  7.68it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 5/5 [00:14<00:00,  2.85s/it]
/Users/xxx/Pictures/Processed/20240928-153019_L1000930-Pano-Edit.jpg
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/PIL/Image.py:3368: DecompressionBombWarning: Image size (116419640 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  warnings.warn(
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>You are an AI assistant that helps people craft a clear and detailed sentence that describes the content depicted in an image.  Then generate a list of descriptive, comma-separated tags for the following image. Analyze the image carefully and produce tags that accurately represent the image. Ensure the tags are relevant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The image depicts a large stone building with a brown roof, situated on the banks of a river, with a church in the background. The purpose of the image is to showcase the architectural beauty and historical significance of the building and its surroundings.

* A large stone building with a brown roof:
    + The building is made of light-colored stone
    + It has a large number of windows
    + The roof is brown and has several chimneys
    + The building appears to be old and historic
* A river in front of the building:
    + The river is wide and calm
    + It reflects the light from the sky
    + There are trees and greenery along its banks
    + The river adds to the serene atmosphere of the scene
* A church in the background:
    + The church is made of stone and has a tall steeple
    + It has a large arched window on the front
    + The church appears to be older than the building in the foreground
    + It adds to the sense of history and tradition in the image

Overall, the image presents a picturesque scene of a historic building and church situated by a river, evoking a sense of tranquility and nostalgia. The contrast between the old and new structures adds depth and interest to the image.<|eot_id|>

which is OK, but not what I asked for.

If I run through your framework:

import mlx.core as mx
from mlx_vlm import load, generate

import os
from pathlib import Path

# model_path = "mlx-community/llava-1.5-7b-4bit"
model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
#model_path = "mlx-community/llava-v1.6-34b-8bit"
#model_path = "mlx-community/Phi-3.5-vision-instruct-bf16"
model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nProvide a formal caption and keywords for this image, suitable for microstock"}],
    tokenize=False,
    add_generation_prompt=True,
)

picpath = "/Users/xxx/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print(pic)

#output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=True)
output = generate(model, processor, pic, prompt, max_tokens=200, verbose=True)

print(output)

I get

(mlx) ➜  mlx_vlm git:(main) ✗ python mytest.py 
Fetching 10 files: 100%|███████████████████████████████████████████████████████| 10/10 [00:00<00:00, 29006.25it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
/Users/jrp/Pictures/Processed/20240928-153019_L1000930-Pano-Edit.jpg
==========
Image: /Users/jrp/Pictures/Processed/20240928-153019_L1000930-Pano-Edit.jpg 

Prompt: <s>[INST] <image>
Provide a formal caption and keywords for this image, suitable for microstock [/INST]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/PIL/Image.py:3368: DecompressionBombWarning: Image size (116419640 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  warnings.warn(
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Caption: "Historic European Castle Overlooking a River"

Keywords: "castle, architecture, river, old, stone, medieval, water, buildings, European, stone bridge, moat, green trees, sky, church, stone walls, historical, picturesque, serene, landscape, stone tower, fortress, ruins, stone roof, stone arch, stone windows, stone chimney, stone gate, stone balcony, stone courtyard, stone stairs, stone railing, stone fencing, stone pillars, stone statues, stone turret, stone archway, stone bridge arch, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway
==========
Prompt: 17.748 tokens-per-sec
Generation: 34.798 tokens-per-sec
Caption: "Historic European Castle Overlooking a River"

Keywords: "castle, architecture, river, old, stone, medieval, water, buildings, European, stone bridge, moat, green trees, sky, church, stone walls, historical, picturesque, serene, landscape, stone tower, fortress, ruins, stone roof, stone arch, stone windows, stone chimney, stone gate, stone balcony, stone courtyard, stone stairs, stone railing, stone fencing, stone pillars, stone statues, stone turret, stone archway, stone bridge arch, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway, stone bridge railing, stone bridge pillars, stone bridge archway

which is slightly less accurate, but closer in format to what I wanted.

I have not run this for a week or more and observe that

Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

is a new warning. I have no idea what values I should set for these attributes.

Also, I don't remember getting the repeat (below the =======) in earlier iterations.

Blaizzy commented 1 month ago

I'm working on porting it to MLX.

chigkim commented 1 month ago

Where can I download the weights for MLX?

Blaizzy / mlx-vlm

LLaMA 3.2 11B Vision Support #60