Closed monatis closed 7 months ago
I have the latest build of the main branch, llava is working (pretty amazing), but it doesnt seem to be using cuda (while the release is built with blas support and works with llama.cpp really good):
PS F:\ai3\llama.cpp> .\build\bin\Release\llava.exe -m ..\models\llava\ggml-model-q5_k.gguf --mmproj ..\models\llava\mmproj-model-f16.gguf --image 'C:\Users\ab\Pictures\mystuff\1664779479174 - Copy.jpg' --temp 0.1 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6 . . llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 4560.96 MB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/35 layers to GPU llm_load_tensors: VRAM used: 0.00 MB
but its not like it doesnt use the GPU, the GPU shows a full activity while what seems like its processing the image, and then the gpu goes idle while the text inference is being streamed, i wish the text inference was also on the gpu (like normal llama). Yes i have tried the ngl & ngld but no changes.
@hexbinoct Does #3621 fix your issue?
edit: I merged it, so that fix should be in the next release or you can compile it from master yourself to be able to offload immediately.
oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.
oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.
@hexbinoct What about image encoding speed? Did it increase as well?
What about image encoding speed? Did it increase as well?
That was already offloaded as far as I know. The performance stayed the same when I tested it.
oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.
@hexbinoct What about image encoding speed? Did it increase as well?
its almost instant, like a second i think, as soon as the last loading message is printed "total VRAM used ..." the text inference starts writing itself meaning the image was completed already. thats what i understand by it.
is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.
is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.
I am curious as to why the q4_k model provides faster evaluation time but slower prompt evaluation time, while the f16 model gives me slower evaluation time but faster prompt evaluation time?
Can we reduce the prompt evaluation time?
f16:
q4_k:
You can try disabling the MMQ feature: just add -nommq
to the command line
Today I played a little bit with the 13b model to understand the reason why it performs worse than 7b even in f16. I extracted image features from the original Python implementation and saved them in a .bin file. Then I read it in C++ instead of encoding an image file. The quality was still bad. Not sure yet, but seems like the LLaMA part might have something to do with it. Will investigate this further, but I might be busy with for some time with the BERT implementation. Until then, any feedback might be helpful when I return back to this.
Where is the source of the 13B LLaMA model? Want to take a look at the config / hparams to see if we have converted everything correctly.
What is the scope of work needed to support new multi modal model releases, like Fuyu-8b? Weights are available but seems will need conversion to GGUF. (I can also move this to a new ticket, since I know this was meant to be focused on LLaVA.)
What is the scope of work needed to support new multi modal model releases, like Fuyu-8b?
It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists. Fuyu looks really interesting but also takes a much different approach from the existing LLaVA stuff as far as I can see. You slice the image into chunks and feed them to it like tokens, with rows separated by a special token.
It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists.
Agree, BakLLaVA can be readily supported with minor modifications to the surgery script for example (see #3682).
Fuyu looks really interesting but also takes a much different approach
I'm really suspicious of this approach. There have already existed several decoder-only multimodal models out there, but it's hard to beat image encoder + decoder approach in terms of performance (speed), accuracy and training / finetuning efficiency. I believe this approach has still way to go.
What they wrote about it makes it seem like their focus is on simplicity, ease of training and speed when deployed more than necessarily outperforming existing approaches in raw ability. https://www.adept.ai/blog/fuyu-8b
The tests they have makes it seem like it's basically on par with the typical approach though.
Is image resizing with linear sample a significant source of error here? Would using stb_image_resize.h help?
I'd like to mix and match -mmproj files (the CLIP) with -m models / different LLM models, to experiment
Manticore-13B.ggmlv3.q4_1.gguf
Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.gguf
WizardLM-13B-Uncensored.Q4_0.gguf
^^for example, these different models for writing stories, that are finetuned to give very long answers. If the mmproj model / CLIP can read in the image, and then use these models for prompting, that's what I'd like to experiment with.
Unfortunately a lot of these models have different embedding dimensions:
main: embedding dim of the multimodal projector (4096) is not equal to that of LLaMA (5120). Make sure that you use the correct mmproj file.
?1 llama.cpp %
Unfortunately a lot of these models have different embedding dimensions:
Well, you can always go and train a projection matrix yourself, for the model of choice.
but if the same base model was used for the finetunes, they should have the same embedding dimensions...
@z3ugma Give a try to the mmproj file of the 13B-variant of LLaVA from here
I'm not sure about the accuracy / performance of this method although some community members reported that they get good results by mixing the mmproj and regular Vicuna models.
@monatis that worked, the 13B
models had the 5120-dimensional embeddings along with the 13B CLIP .mmproj file you linked.
Great! I'd be interested in hearing about your impressions for the quality of generation you obtain with the models you cited.
This issue was closed because it has been inactive for 14 days since being marked as stale.
With #3436, llama.cpp has support for LLaVA, state-of-the-art large multimodal model. It appears that there is still room for improvement in its performance and accuracy, so I'm opening this issue to track and get feedback from the community. I'll continue to work on this, so any feedback is much appreciated.