Open jrp2014 opened 2 days ago
Yes, you just need to pick any of the llava-1.6 on the mlx-community repo.
I don't know exactly the memory requirements because you have to factor image processing.
The example, with "model_path = "mlx-community/llava-v1.6-34b-8bit" results in
The image you've provided appears to show a pair of shoes. However, the image is quite blurry and it's difficult to make out specific details. If you have any specific questions about shoes or need information related to them, feel free to ask!
rather than the two cat result expected, and produced by the v1.5 model, or
These are two cats lying on a pink blanket. The cat on the left appears to be a kitten with a striped coat, while the cat on the right is a larger cat with a tabby pattern. They seem to be resting or sleeping, and there are remote controls nearby, suggesting that they might be in a living room or a similar space where people relax and watch television.
with "model_path = "mlx-community/llava-v1.6-mistral-7b-8bit""
Also, the python example on https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/llava seems not to work out of the box as you get "ImportError: attempted relative import with no known parent package" diagnostics. There is no doubt a simple fix.
Hey @jrp2014
I ran some tests, and I can't replicate the issue you presented.
I downloaded fresh copies of llava-v1.6-34b-4bit and 8bit and both provided the correct answer.
llava-v1.6-34b-4bit
llava-v1.6-34b-8bit
Also, the python example on https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/llava seems not to work out of the box as you get "ImportError: attempted relative import with no known parent package" diagnostics. There is no doubt a simple fix.
What example in particular did you try that failed?
Same example. On a 48Gb RAM M3 Max MacBook Pro. Ow do you clear the cache and get a new c of the model?
How are you running it? Via CLI or a script?
The python script from the front page. (The script on the Llava page doesn’t work for me (see above).
Please run it again and share a screenshot of your terminal like the one I provided earlier :)
The script that I am using is:
import mlx.core as mx
from mlx_vlm import load, generate
# model_path = "mlx-community/llava-1.5-7b-4bit"
#model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
model_path = "mlx-community/llava-v1.6-34b-8bit"
model, processor = load(model_path)
prompt = processor.tokenizer.apply_chat_template(
[{"role": "user", "content": f"<image>\nProvide a caption and keywords for this image"}],
tokenize=False,
add_generation_prompt=True,
)
image = "/Users/jrp/Pictures/Aiarty Output/20240622-154844_DSC00820_DxO_photo_x1_8640x5760.jpeg"
output = generate(model, processor, image, prompt, verbose=True)
print(output)
The results that I get with the 7B model are:
(mlx) ➜ mlx_vlm git:(main) ✗ python mytest.py
Fetching 10 files: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 151418.92it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Image: /Users/jrp/Pictures/Aiarty Output/20240622-154844_DSC00820_DxO_photo_x1_8640x5760.jpeg
Prompt: <s>[INST] <image>
Provide a caption and keywords for this image [/INST]
Caption: "Exploring the ancient ruins of a castle, surrounded by nature and history."
Keywords: castle, ruins, history, architecture, nature, stone, brick, people, tourism, outdoor, green, grass, trees, path, walkway, medieval, heritage, visit, sightseeing, travel, landscape, stone wall, old, fortress, tourist attraction, historical site, group, visitation, leisure, outdoor activity, exploration, visitation
(which is pretty good). With the 34B model, I get:
Fetching 15 files: 100%|███████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 110376.42it/s]
==========
Image: /Users/jrp/Pictures/Aiarty Output/20240622-154844_DSC00820_DxO_photo_x1_8640x5760.jpeg
Prompt: <|im_start|>user
<image>
Provide a caption and keywords for this image<|im_end|>
<|im_start|>assistant
Caption: A group of people standing on a beach.
Keywords: beach, people, group, standing, ocean, sand, shore, water, waves, horizon, sky, clouds, sun, weather, day, outdoor, leisure, vacation, travel, tourism, landscape, scenery, nature, environment, coastal, shoreline, seascape, seaside, coastal, shore, shoreline, seascape, seaside, coastal, shore, shoreline, seascape,
(which is is nothing like the image.)
It seems likely to be some sort of memory overflow, but it'd be better to say "out of memory" or whatever, than to err hallucinate. The only other thing I can think of is that I there were various diagnostics related to locks when I first downloaded the 34B model, but nothing since.
Can you share the image you used? I will try and debug it.
As described above, it's a similar result with the image in the front page example (which seems to run without problem in your context, from the above).
I'm talking about this image:
Image: /Users/jrp/Pictures/Aiarty Output/20240622-154844_DSC00820_DxO_photo_x1_8640x5760.jpeg
Running the script on the front page, I get:
Is this OK??
I assume that to run the lava v1.6 models I just look at the versions available on hugging face. Are there any further details on how much memory is required to run the 7B v 34B variants, and how much better the 8bit v 4bit takes are, please?