ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.43k stars 9.68k forks source link

multimodal - Improve LLaVA model accuracy and performance #3602

Closed monatis closed 7 months ago

monatis commented 1 year ago

With #3436, llama.cpp has support for LLaVA, state-of-the-art large multimodal model. It appears that there is still room for improvement in its performance and accuracy, so I'm opening this issue to track and get feedback from the community. I'll continue to work on this, so any feedback is much appreciated.

hexbinoct commented 1 year ago

I have the latest build of the main branch, llava is working (pretty amazing), but it doesnt seem to be using cuda (while the release is built with blas support and works with llama.cpp really good):

PS F:\ai3\llama.cpp> .\build\bin\Release\llava.exe -m ..\models\llava\ggml-model-q5_k.gguf --mmproj ..\models\llava\mmproj-model-f16.gguf --image 'C:\Users\ab\Pictures\mystuff\1664779479174 - Copy.jpg' --temp 0.1 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6 . . llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 4560.96 MB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/35 layers to GPU llm_load_tensors: VRAM used: 0.00 MB

but its not like it doesnt use the GPU, the GPU shows a full activity while what seems like its processing the image, and then the gpu goes idle while the text inference is being streamed, i wish the text inference was also on the gpu (like normal llama). Yes i have tried the ngl & ngld but no changes.

KerfuffleV2 commented 1 year ago

@hexbinoct Does #3621 fix your issue?

edit: I merged it, so that fix should be in the next release or you can compile it from master yourself to be able to offload immediately.

hexbinoct commented 1 year ago

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

aiaicode commented 1 year ago

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

KerfuffleV2 commented 1 year ago

What about image encoding speed? Did it increase as well?

That was already offloaded as far as I know. The performance stayed the same when I tested it.

hexbinoct commented 1 year ago

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

its almost instant, like a second i think, as soon as the last loading message is printed "total VRAM used ..." the text inference starts writing itself meaning the image was completed already. thats what i understand by it.

hexbinoct commented 1 year ago

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

aiaicode commented 1 year ago

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

Issue 3593

y10ab1 commented 1 year ago

I am curious as to why the q4_k model provides faster evaluation time but slower prompt evaluation time, while the f16 model gives me slower evaluation time but faster prompt evaluation time?

Can we reduce the prompt evaluation time?

f16:

截圖 2023-10-16 上午11 50 52

q4_k:

截圖 2023-10-16 上午11 51 26
ggerganov commented 1 year ago

You can try disabling the MMQ feature: just add -nommq to the command line

monatis commented 1 year ago

Today I played a little bit with the 13b model to understand the reason why it performs worse than 7b even in f16. I extracted image features from the original Python implementation and saved them in a .bin file. Then I read it in C++ instead of encoding an image file. The quality was still bad. Not sure yet, but seems like the LLaMA part might have something to do with it. Will investigate this further, but I might be busy with for some time with the BERT implementation. Until then, any feedback might be helpful when I return back to this.

ggerganov commented 1 year ago

Where is the source of the 13B LLaMA model? Want to take a look at the config / hparams to see if we have converted everything correctly.

monatis commented 1 year ago

Here it is: https://huggingface.co/liuhaotian/llava-v1.5-13b/blob/main/config.json

rlancemartin commented 1 year ago

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b? Weights are available but seems will need conversion to GGUF. (I can also move this to a new ticket, since I know this was meant to be focused on LLaVA.)

https://huggingface.co/adept/fuyu-8b

KerfuffleV2 commented 1 year ago

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b?

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists. Fuyu looks really interesting but also takes a much different approach from the existing LLaVA stuff as far as I can see. You slice the image into chunks and feed them to it like tokens, with rows separated by a special token.

monatis commented 1 year ago

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists.

Agree, BakLLaVA can be readily supported with minor modifications to the surgery script for example (see #3682).

Fuyu looks really interesting but also takes a much different approach

I'm really suspicious of this approach. There have already existed several decoder-only multimodal models out there, but it's hard to beat image encoder + decoder approach in terms of performance (speed), accuracy and training / finetuning efficiency. I believe this approach has still way to go.

KerfuffleV2 commented 1 year ago

What they wrote about it makes it seem like their focus is on simplicity, ease of training and speed when deployed more than necessarily outperforming existing approaches in raw ability. https://www.adept.ai/blog/fuyu-8b

The tests they have makes it seem like it's basically on par with the typical approach though.

jxy commented 1 year ago

Is image resizing with linear sample a significant source of error here? Would using stb_image_resize.h help?

z3ugma commented 1 year ago

I'd like to mix and match -mmproj files (the CLIP) with -m models / different LLM models, to experiment

Manticore-13B.ggmlv3.q4_1.gguf
Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.gguf
WizardLM-13B-Uncensored.Q4_0.gguf

^^for example, these different models for writing stories, that are finetuned to give very long answers. If the mmproj model / CLIP can read in the image, and then use these models for prompting, that's what I'd like to experiment with.

Unfortunately a lot of these models have different embedding dimensions:

main: embedding dim of the multimodal projector (4096) is not equal to that of LLaMA (5120). Make sure that you use the correct mmproj file.
?1 llama.cpp % 
Green-Sky commented 1 year ago

Unfortunately a lot of these models have different embedding dimensions:

Well, you can always go and train a projection matrix yourself, for the model of choice.

but if the same base model was used for the finetunes, they should have the same embedding dimensions...

monatis commented 1 year ago

@z3ugma Give a try to the mmproj file of the 13B-variant of LLaVA from here

I'm not sure about the accuracy / performance of this method although some community members reported that they get good results by mixing the mmproj and regular Vicuna models.

z3ugma commented 1 year ago

@monatis that worked, the 13B models had the 5120-dimensional embeddings along with the 13B CLIP .mmproj file you linked.

monatis commented 1 year ago

Great! I'd be interested in hearing about your impressions for the quality of generation you obtain with the models you cited.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.