Closed cmp-nct closed 4 months ago
I know about that model from some time and also mentioned here but NO ONE CARE ... why I do not know ....
+1 CogVLM is the best open source vision model currently available. Having a super powerful multi-modal LLM that's easy to run locally is a game changer.
I know that Ollama is looking to add CogVLM support, but they need llama.cpp to support it first.
+1 CogVLM/CogAgent is amazing on mobile UI dection and UI object detection.
Is better than GPT4:V
+1 we need it ! asap
MobileVLM might be even better
MobileVLM might be even better
There is no demo of it online, it uses the same vision encoder and similar projection as llava just a tiny llm instead of the 7B vicuna. It is trained on llava-1.5 data (so lacks a lot of sharegptv4 knowledge) but claiming that tiny thing is at eye level with GPT4V needs some more evidence than 5 words. The benchmarks do not support it.
I didn't test it on llama.cpp but my guess is that it requires minimal changes to get the language model supported - the projection has small changes as well (normalization) Regarding support: the authors actually already had it working in llama.cpp according to the paper (mentioned using Q4 on the llm) but didn't release the changes as a fork or PR for some reason ?
I'm not saying it is not what you claim - just from what I've seen at a first view I find it highly unlikely. Would be a huge development in showcasing what the small CLIP can do despite everyone else not being able to do the same.
I believe MobileVLM is worthy of support, it's tiny and appears to be a little bit worse than llava-1.5 but of course much faster. Should not distract from CogVLM being the best open source one
cogvlm is far better then llava - llava already works on most - so please lets stick with cogvlm if anyone would embark on that - as it takes about 80g vram here in fp16 .. and bnb isnt cutting it
pong - how does that not get any traction ?! the main example given is worthless as thats simple ocr - but cogvlm is so much better then llava on every vertical
I started looking into it. But have many on my schedule currently.
@darkacorn, I'd like to test it, what's your branch called?
On Wed, 10 Jan 2024, 09:41 darkacorn, @.***> wrote:
pong - how does that not get any traction ?!
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-1884501457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSS6SVTNMMUKMI23ULDYNZO25AVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGUYDCNBVG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
https://github.com/THUDM/CogVLM no branch not affiliated with them either
I started looking into it. But have many on my schedule currently.
understandable - i talked to turboderp(exllama) and casper (autoawq) too .. apparently its quite a bit of work to get a quant / inference going outside of the regular transformer arch
Jip that's also my feeling. To make the deep feature fusion work you have to provide an additonal mask as input. That's quite different from the usual stuff.
I'd love to see Cogvlm support as well.
You are all (including me) welcome to contribute either code or money for a coffee for these hard-working individuals making this stuff work for us.
I'm also waiting for it ....
we all wait - we need more eyeballs on that "feature request" .. sadly most people dont seem to care enough about vision yet
information from turbo(exllama) to get a rough version done is about 50h of work initially and then ofc the upkeep of it - but given the litte demand is has - it seems to be a wasted effort
we really just need 1 quant .. and then we can adapt it pretty quickly to everything else
What do you mean it seems to be a wasted effort?
"given the litte demand is has - it seems to be a wasted effort" i dont know how i could be clearer in that statement - if more people would be interrested in vision that would turn faster .. but apparently most just focus on regular llm's / multimodality does sadly not have a huge demand
Alright I wasn't trying to diminsh your text but thanks for explaining it, I did not realize that.
trust me i would love it to be quanted too - makes my life easyer .. 36g per fp16 model and you eventually want all 3 in vram just blocks my resources up - i would love to have it smaller and faster - but if there are not a few experts chip in and start .. - its just not the most rewarding work for them as very few people want it even if cogvlm is the best vision model we got
https://github.com/THUDM/CogVLM/discussions/346
lets see maybe they chip in and get the ball rolling
I also don't understand why is so low interest in Cogvlm because is far more better than llava which is still in development....
After working on it for a bit I found that it is not trivial to convert it to llama.cpp. The implementation of EVA-CLIP is different from the OpenAI CLIP model. There are some subtleties I'm trying to wrap my head around. So progress is relatively slow but intereset is there...
@dtiarks if you are up for it hop on discord we are all on the TheBloke AI's discord ( link should be on any huggingface repo he has/ dont want to spam there here)
thanks for narrowing the problem set down at least a bit
im sure turboderp / casper can help narrow those "sublties" down even further
This would be a game changer, since CogVLM is so much better than llava. Using llava after seeing what CogVLM can do feels like asking llama 7B for code after using gpt 4.
I personally have changed my mind, CogVLM is a huge thing - no one really wanted to invest the work integrating it. Here we have xcomposer2, which is almost as fast as llava-1.5, higher resolution than CogVLM and quite possibly better as well. https://github.com/ggerganov/llama.cpp/pull/5232
A good part of the work is done, though my time is getting lower for a while and the lora integration is not done yet
If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.
https://github.com/ggerganov/llama.cpp/issues/5266
On Sat, 3 Feb 2024, 22:04 lon, @.***> wrote:
If someone is up to the task, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-1925466967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSVXT27MA3FTX45DVTTYR2X7VAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGQ3DMOJWG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.
@cmp-nct is already working on Llava 1.6 in https://github.com/ggerganov/llama.cpp/pull/5267
If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.
@cmp-nct is already working on Llava 1.6 in #5267
Yes, but it seems like the author can't work on it and/or has other priorities
If someone is up for it, I'd like to propose a bounty for getting CogVLM and/or LLaVa 1.6 working and merged. If interested, contact me at claude @ the domain in my profile.
@cmp-nct is already working on Llava 1.6 in #5267
Yes, but it seems like the author can't work on it and/or has other priorities
For me it's a small side project, I have dozens of large (commercial) projects I am working on. I really want to see full llava-1.6 (as well as cogvlm and xcomposer2) support, llava-1.6 is the best candidate followed by xcomposer2 then covvlm.
A bounty is intriguing, ironically I once tried the same on fiverr once to advance this project and not one of the "AI developers" there is actually able to contribute anything. Though the requirement to have it merged is maybe too much, as you only have limited influence in actually getting something merged even if a PR is fully functional.
So I've not given it up, I just have slow progress atm. Also happy to add a collaborator into my PR-branch of course.
One pesky bug is remaining but it's working quite great already, especially the large model. https://github.com/ggerganov/llama.cpp/pull/5267
You'll need to re-create the projector gguf files, you can keep the llm gguf files. For the projector you need to add the variables into the config.json (as described), otherwise it will be detected as llava-1.5. I've uploaded a config.json to my HF, uploading the projectors as well.
You'll notice llava-1.6 is working if you need a ton of embedding tokens. llava-1.5 has 576 tokens, llava-1.6 up to 5 times that.
Update The PR is ready for use, the tensor bug is handled, it's not merged into main llama.cpp yet. So you need to manually check out the branch/PR
Now that LLaVA 1.6 has been added, is there no longer much interest in adding CogVLM?
@mjspeck I started implementing it and got pretty far. However I got stuck at a point where I need some input from experts like @ggerganov
There is a branch at https://github.com/dtiarks/llama.cpp/tree/cog-vlm The code is under examples/cog-vlm. The problem is that the language model's ("deepfeaturefusion") graph seems to be broken when selecting the correct expert. This is somewhat similar to the MoE implementation. Maybe @ggerganov or someone else can help.
I had put my attention on the dynamic-lora-expert internlm implemented (xcomposer2), which shows very similar results and spatial awareness as cogvlm but probably a magnitude faster in performance. However, I got stuck in tensor differences, potentially reaching into the current CLIP implementation or just an error in how I am managing the attention lora (the attention calculations and permutations in pytorch are significantly differing from llama.cpp). Debugging those differences are super time intense. So I got stuck there and currently look into other areas.
cogVLM is still interesting imho, I'm just doubting the longterm potential given much smaller networks show similar powers. Maybe I'm mistaken though, I've limited view on the difference in their output.
The performance of the Cog agent is what's most interesting. Not sure if LLaVA 1.6 has been tested on similar problems or if xcomposer2 has either.
Just wanna say we still would have a lot of interest in using CogVLM on llama.cpp
+1
There's v2 now: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B
wow
NotImplementedError: Architecture 'CogVLMForCausalLM' not supported!
+1
I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.
One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.
Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.
I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.
One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.
Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.
infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M
doesn't ollama use llama.cpp?
On Mon, 3 Jun 2024, 09:06 chaoqunxie, @.***> wrote:
I think that llava-1.6 is the better one, it is heavyweight compared to 1.5 but lighter than cog and with batching optimization it could be almost as fast as llava 1.5. Batching would not be difficult to add into clip.cpp! It's basically ready for it, just needs some tunes.
One big step missing for out llava 1.6 implementation is the line based tensor manipulation. llama.cpp llava 1.6 implementation uses the more simple variant of llava 1.6 because of lack of 5d tensors I was not able to get that properly implemented so I had to take a shortcut. That shortcut is noticeable when it comes to OCR for example.
Someone who is very good in ggml tensors (better than me) could add the line based manipulation into llava 1.6. Then we could add the batching into CLIP to run all llava-1.6 image batches at once instead of sequential and we'd have a very high quality result. Surpassing cogvlm imho. At much less work than implementing the whole cog architecture.
infact ollama supportted see https://ollama.com/library/llava:13b-v1.6-vicuna-q5_K_M
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/4387#issuecomment-2144537678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSUMSCZMCEL4WJBSLXLZFQPZBAVCNFSM6AAAAABANM23OOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGUZTONRXHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
This issue was closed because it has been inactive for 14 days since being marked as stale.
Still not supported.
Still not supported.
Discussed in https://github.com/ggerganov/llama.cpp/discussions/4350