Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward.

cmp-nct commented 12 months ago

Discussed in https://github.com/ggerganov/llama.cpp/discussions/4350

^{Originally posted by **cmp-nct** December 7, 2023} I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license. I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations. It understands handwritten as well as typed letters, context, fine details, background graphics It can also locate tiny visual targets with pixel coordinates I'm quite blown away that I didn't know it before.. I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement. In addition the ViT needs K-type quantization support. Definitely worth a close look URL: https://github.com/THUDM/CogVLM Webdemo: http://36.103.203.44:7861/ Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors: ![image](https://github.com/ggerganov/llama.cpp/assets/78893154/4834e936-4bd5-4997-bde5-c58283556772) ``` The image appears to be a driver's license from California. Here's a JSON representation: { "License_Number": "DL 11234568", "Expiration_Date": "08/31/2014", "Class": "C", "License_Holder": "Iva Cardholder", "Address": "2570 24th Street, Anytown, CA 95818", "Veteran": "YES", "Sex": "F", "Hair_Color": "Brown", "Eyes": "Brown", "Height": "125 lb", "Weight": "125 lb", "Issue_Date": "08/31/2009", "Birth_Date": "08/31/1977", "VISUALS": { "left": "Iva Cardholder", "right": "Iva Cardholder", "top": "California Bear", "bottom": "Gold Rush Miner" } } ``` Here is what QWEN-VL does: ``` { "id": "123456", "issue_date": "08/31/2014", "expiration_date": "08/31/2024", "cardholder": { "name": "Ina Cordero", "gender": "F", "race": "A", "eye_color": "B", "hair_color": "R", "eyebrow_color": "N", "height": "5'6"", "weight": "126 lb" }, "class": "C", "type": "DRIVER LICENSE", "state": "CA", "country": "USA" } ``` Here is llava1.5-13B: ``` { "image": "https://i.imgur.com/39vZv.jpg", "description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890." } ``` I've not yet looked into architectural challenges. But this is literally game changer.. That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V @monatis @FSSRepo

mirek190 commented 12 months ago

I know about that model from some time and also mentioned here but NO ONE CARE ... why I do not know ....

ericruleman commented 11 months ago

+1 CogVLM is the best open source vision model currently available. Having a super powerful multi-modal LLM that's easy to run locally is a game changer.

I know that Ollama is looking to add CogVLM support, but they need llama.cpp to support it first.

truebit commented 11 months ago

+1 CogVLM/CogAgent is amazing on mobile UI dection and UI object detection.

mirek190 commented 11 months ago

Is better than GPT4:V

darkacorn commented 11 months ago

+1 we need it ! asap

Foul-Tarnished commented 11 months ago

MobileVLM might be even better

cmp-nct commented 11 months ago

MobileVLM might be even better

There is no demo of it online, it uses the same vision encoder and similar projection as llava just a tiny llm instead of the 7B vicuna. It is trained on llava-1.5 data (so lacks a lot of sharegptv4 knowledge) but claiming that tiny thing is at eye level with GPT4V needs some more evidence than 5 words. The benchmarks do not support it.

I didn't test it on llama.cpp but my guess is that it requires minimal changes to get the language model supported - the projection has small changes as well (normalization) Regarding support: the authors actually already had it working in llama.cpp according to the paper (mentioned using Q4 on the llm) but didn't release the changes as a fork or PR for some reason ?

I'm not saying it is not what you claim - just from what I've seen at a first view I find it highly unlikely. Would be a huge development in showcasing what the small CLIP can do despite everyone else not being able to do the same.

I believe MobileVLM is worthy of support, it's tiny and appears to be a little bit worse than llava-1.5 but of course much faster. Should not distract from CogVLM being the best open source one

darkacorn commented 11 months ago

cogvlm is far better then llava - llava already works on most - so please lets stick with cogvlm if anyone would embark on that - as it takes about 80g vram here in fp16 .. and bnb isnt cutting it

darkacorn commented 10 months ago

pong - how does that not get any traction ?! the main example given is worthless as thats simple ocr - but cogvlm is so much better then llava on every vertical

dtiarks commented 10 months ago

I started looking into it. But have many on my schedule currently.

husnoo commented 10 months ago

@darkacorn, I'd like to test it, what's your branch called?

On Wed, 10 Jan 2024, 09:41 darkacorn, @.***> wrote:

pong - how does that not get any traction ?!

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-1884501457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSS6SVTNMMUKMI23ULDYNZO25AVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGUYDCNBVG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

darkacorn commented 10 months ago

https://github.com/THUDM/CogVLM no branch not affiliated with them either

darkacorn commented 10 months ago

I started looking into it. But have many on my schedule currently.

understandable - i talked to turboderp(exllama) and casper (autoawq) too .. apparently its quite a bit of work to get a quant / inference going outside of the regular transformer arch

dtiarks commented 10 months ago

Jip that's also my feeling. To make the deep feature fusion work you have to provide an additonal mask as input. That's quite different from the usual stuff.

chigkim commented 10 months ago

I'd love to see Cogvlm support as well.

glarsson commented 10 months ago

You are all (including me) welcome to contribute either code or money for a coffee for these hard-working individuals making this stuff work for us.

mirek190 commented 10 months ago

I'm also waiting for it ....

darkacorn commented 10 months ago

we all wait - we need more eyeballs on that "feature request" .. sadly most people dont seem to care enough about vision yet

information from turbo(exllama) to get a rough version done is about 50h of work initially and then ofc the upkeep of it - but given the litte demand is has - it seems to be a wasted effort

we really just need 1 quant .. and then we can adapt it pretty quickly to everything else

glarsson commented 10 months ago

What do you mean it seems to be a wasted effort?

darkacorn commented 10 months ago

"given the litte demand is has - it seems to be a wasted effort" i dont know how i could be clearer in that statement - if more people would be interrested in vision that would turn faster .. but apparently most just focus on regular llm's / multimodality does sadly not have a huge demand

glarsson commented 10 months ago

Alright I wasn't trying to diminsh your text but thanks for explaining it, I did not realize that.

darkacorn commented 10 months ago

trust me i would love it to be quanted too - makes my life easyer .. 36g per fp16 model and you eventually want all 3 in vram just blocks my resources up - i would love to have it smaller and faster - but if there are not a few experts chip in and start .. - its just not the most rewarding work for them as very few people want it even if cogvlm is the best vision model we got

https://github.com/THUDM/CogVLM/discussions/346

lets see maybe they chip in and get the ball rolling

mirek190 commented 10 months ago

I also don't understand why is so low interest in Cogvlm because is far more better than llava which is still in development....

dtiarks commented 10 months ago

After working on it for a bit I found that it is not trivial to convert it to llama.cpp. The implementation of EVA-CLIP is different from the OpenAI CLIP model. There are some subtleties I'm trying to wrap my head around. So progress is relatively slow but intereset is there...

darkacorn commented 10 months ago

@dtiarks if you are up for it hop on discord we are all on the TheBloke AI's discord ( link should be on any huggingface repo he has/ dont want to spam there here)

thanks for narrowing the problem set down at least a bit

im sure turboderp / casper can help narrow those "sublties" down even further

longregen commented 10 months ago

This would be a game changer, since CogVLM is so much better than llava. Using llava after seeing what CogVLM can do feels like asking llama 7B for code after using gpt 4.

cmp-nct commented 10 months ago

I personally have changed my mind, CogVLM is a huge thing - no one really wanted to invest the work integrating it. Here we have xcomposer2, which is almost as fast as llava-1.5, higher resolution than CogVLM and quite possibly better as well. https://github.com/ggerganov/llama.cpp/pull/5232

A good part of the work is done, though my time is getting lower for a while and the lora integration is not done yet

longregen commented 10 months ago

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

husnoo commented 10 months ago

https://github.com/ggerganov/llama.cpp/issues/5266

On Sat, 3 Feb 2024, 22:04 lon, @.***> wrote:

If someone is up to the task, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-1925466967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSVXT27MA3FTX45DVTTYR2X7VAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQ3DMOJWG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chigkim commented 10 months ago

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in https://github.com/ggerganov/llama.cpp/pull/5267

longregen commented 10 months ago

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

cmp-nct commented 10 months ago

If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.

@cmp-nct is already working on Llava 1.6 in #5267

Yes, but it seems like the author can't work on it and/or has other priorities

For me it's a small side project, I have dozens of large (commercial) projects I am working on. I really want to see full llava-1.6 (as well as cogvlm and xcomposer2) support, llava-1.6 is the best candidate followed by xcomposer2 then covvlm.

A bounty is intriguing, ironically I once tried the same on fiverr once to advance this project and not one of the "AI developers" there is actually able to contribute anything. Though the requirement to have it merged is maybe too much, as you only have limited influence in actually getting something merged even if a PR is fully functional.

So I've not given it up, I just have slow progress atm. Also happy to add a collaborator into my PR-branch of course.

90% done

cmp-nct commented 10 months ago

One pesky bug is remaining but it's working quite great already, especially the large model. https://github.com/ggerganov/llama.cpp/pull/5267

You'll need to re-create the projector gguf files, you can keep the llm gguf files. For the projector you need to add the variables into the config.json (as described), otherwise it will be detected as llava-1.5. I've uploaded a config.json to my HF, uploading the projectors as well.

You'll notice llava-1.6 is working if you need a ton of embedding tokens. llava-1.5 has 576 tokens, llava-1.6 up to 5 times that.

Update The PR is ready for use, the tensor bug is handled, it's not merged into main llama.cpp yet. So you need to manually check out the branch/PR

mjspeck commented 9 months ago

Now that LLaVA 1.6 has been added, is there no longer much interest in adding CogVLM?

dtiarks commented 9 months ago

@mjspeck I started implementing it and got pretty far. However I got stuck at a point where I need some input from experts like @ggerganov

There is a branch at https://github.com/dtiarks/llama.cpp/tree/cog-vlm The code is under examples/cog-vlm. The problem is that the language model's ("deepfeaturefusion") graph seems to be broken when selecting the correct expert. This is somewhat similar to the MoE implementation. Maybe @ggerganov or someone else can help.

cmp-nct commented 9 months ago

I had put my attention on the dynamic-lora-expert internlm implemented (xcomposer2), which shows very similar results and spatial awareness as cogvlm but probably a magnitude faster in performance. However, I got stuck in tensor differences, potentially reaching into the current CLIP implementation or just an error in how I am managing the attention lora (the attention calculations and permutations in pytorch are significantly differing from llama.cpp). Debugging those differences are super time intense. So I got stuck there and currently look into other areas.

cogVLM is still interesting imho, I'm just doubting the longterm potential given much smaller networks show similar powers. Maybe I'm mistaken though, I've limited view on the difference in their output.

mjspeck commented 9 months ago

The performance of the Cog agent is what's most interesting. Not sure if LLaVA 1.6 has been tested on similar problems or if xcomposer2 has either.

mjspeck commented 7 months ago

Just wanna say we still would have a lot of interest in using CogVLM on llama.cpp

mirek190 commented 7 months ago

+1

daaain commented 6 months ago

There's v2 now: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B

mirek190 commented 6 months ago

wow

xiaoyuediandao commented 6 months ago

NotImplementedError: Architecture 'CogVLMForCausalLM' not supported!

marksalpeter commented 6 months ago

+1

cmp-nct commented 6 months ago

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.

chaoqunxie commented 6 months ago

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.

infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M

husnoo commented 6 months ago

doesn't ollama use llama.cpp?

On Mon, 3 Jun 2024, 09:06 chaoqunxie, @.***> wrote:

I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.

One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.

Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.

infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-2144537678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSUMSCZMCEL4WJBSLXLZFQPZBAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGUZTONRXHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

geroldmeisinger commented 6 months ago

int4 version: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

chigkim commented 4 months ago

Still not supported.

jiaolongxue commented 3 months ago

Still not supported.

ggerganov / llama.cpp

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Discussed in https://github.com/ggerganov/llama.cpp/discussions/4350