difficulties involved with inference on a consumer GPU

Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)

https://minigpt-4.github.io

BSD 3-Clause "New" or "Revised" License

25.41k stars 2.92k forks source link

difficulties involved with inference on a consumer GPU #4

Open 152334H opened 1 year ago

152334H commented 1 year ago

Here are the problems I've found today:

vicuna is loaded as fp16. This is a problem for obvious reasons (13B * 2 > all consumer GPUs).
beam search default of 5 beams consumes a lot of vram during generation

To address these problems, I have created a fork where:

vicuna is loaded with 8bit
num_beams is set to 1 by default
I also put ViT-L on CPU (as fp32), because the encoder only needs 1 pass

152334H commented 1 year ago

note: I have discovered that it is still possible to OOM on a 3090 after repeated inference. May be hitting a memleak somewhere.

TsuTikgiau commented 1 year ago

Thanks for your help on this! We included your updated code now. For running on a consumer GPU, we plan to train a Vicuna 7B version of our method in two days. I think this should reduce the pressure of the GPU memory dramatically

Snowad14 commented 1 year ago

I have not been able to try on the site or without optimizing but I have very... particular results.

Capture d’écran 2023-04-17 180810 Capture d’écran 2023-04-17 181121

152334H commented 1 year ago

@Snowad14 Complete guess -- you might've loaded the wrong llama weights. Check that you are using Vicuna-13B-v0?

Snowad14 commented 1 year ago

I had the laziness to download those of llama to then make the conversion so I had download which seem to be the version (https://huggingface.co/Ejafa/vicuna_13B_vanilla) directly converted of the old version but it is perhaps the bad weights. I'll try to convert it myself

cibernicola commented 1 year ago

So is there a way to load vicuna with 8bit ?

I mean, will it work on a 3090 under Windows? :D

yhyu13 commented 1 year ago

@TsuTikgiau It would even better to use llama.cpp with ggml format which compress 7B into 4B, and 12B into 7B. Plus, ALL OF THEM runs on the CPU with vectorized instruction with comparable performace when running on 3090 (due to its nerfied AI inference computation capacity)

https://huggingface.co/eachadea/legacy-ggml-vicuna-7b-4bit

Already a popular project built on CPU inferencing, which worked like charm : https://github.com/nomic-ai/gpt4all-ui

TsuTikgiau commented 1 year ago

@cibernicola the default setting of the demo will load Vicuna with 8bit now and cost less than 24GB GPU memory if you keep the beam width as 1

TsuTikgiau commented 1 year ago

@yhyu13 Thank you for refering this to us! We will consider this when we have time these days :D

zxcvbn114514 commented 1 year ago

how to put VIT-L back to gpu?The model is running extremely slowly now while the vram usage is only 17g/24g 3090.

152334H commented 1 year ago

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop.

zxcvbn114514 commented 1 year ago

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop. So it's normal for the 3090 to run at about 80 watts when the model is running?