dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
https://dusty-nv.github.io/NanoLLM/
MIT License
132 stars 16 forks source link

LLaVa 1.6 anyres support #8

Open ai-and-i opened 2 months ago

ai-and-i commented 2 months ago

Hi @dusty-nv, thanks for this amazing library! We're using it in a cool art project for Burning Man :-)

I tested the new llava 1.6 (specifically https://huggingface.co/lmms-lab/llama3-llava-next-8b), and it seems to work. However, reading the code, it seems that anyres feature (dividing images into multiple patches) is not implemented in nanollm yet. For now, the image is simply downscaled (and cropped or padded) to a 336x336 square. Is that accurate, or am I missing anything?

dusty-nv commented 2 months ago

Hi @ai-and-i , that's correct, I only do the single tile to keep latency to a minimum. Although I'd like to support this now that I have CLIP/SigLIP running faster through TensorRT. The anyres tiling also pretty rapidly develops which is a challenge to keep up with and adds more image tokens. The latest VILA models have increased 384/448 input res and 196 image tokens.

Would love to hear more about your Burning Man project, sounds awesome! And your latency/accuracy tradeoffs would be good to understand. I optimize for max FPS but understand others might have different priorities


From: ai-and-i @.> Sent: Saturday, May 18, 2024 1:24:30 PM To: dusty-nv/NanoLLM @.> Cc: Dustin Franklin @.>; Mention @.> Subject: [dusty-nv/NanoLLM] LLaVa 1.6 anyres support (Issue #8)

Hi @dusty-nvhttps://github.com/dusty-nv, thanks for this amazing library! We're using it in a cool art project for Burning Man :-)

I tested the new llava 1.6 (specifically https://huggingface.co/lmms-lab/llama3-llava-next-8b), and it seems to work. However, reading the code, it seems that anyres feature (dividing images into multiple patches) is not implemented in nanollm yet. For now, the image is simply downscaled (and cropped or padded) to a 336x336 square. Is that accurate, or am I missing anything?

— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/NanoLLM/issues/8, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK674H6Z3HOS67I6W7DZC6FE5AVCNFSM6AAAAABH5T2TECVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDIMJXGU3TOMY. You are receiving this because you were mentioned.Message ID: @.***>

ai-and-i commented 2 months ago

Great, thanks for confirming! We're still exploring, and will likely end up preferring low latency too. Unfortunately Llama-3-VILA1.5-8B didn't produce great results with our prompts, but llava-1.6 seems promising so far — even without anyres.

In case you're interested, we just recently put some photos of the last year's version here: https://www.instagram.com/ai_and_i.art/ . It was based on GPT-4, but unreliable internet and model latency were quite annoying. So this year it's all built around Jetson AGX Orin and local LLMs — with a lot more fun costumes, animations and interactions already in the works!

rsun-bdti commented 1 month ago

Llama-3-VILA1.5-8B didn't produce great results no matter how we tweaked our prompts. I also observed that.