Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

Fix SPHINX inference memory with image input #116

Closed linziyi96 closed 7 months ago

linziyi96 commented 7 months ago

In the previous implementation, the encode_image methods of the SPHINX series models return FP32 features. This promotes the concatenated (image+text) embeddings to FP32 and leads to a few GBs of memory overhead, which in turn makes the 4-bit quantized SPHINX OOM on 24GB GPUs.

With the dtype explicitly casted back the models now run fine on 24GB GPUs (NF4 quantized) at the longest sequence length (4k)

image

Also pinning gradio to 3.x for now as gradio 4.x seem to introduce compatibility issues. Will bump the version in the future after some testings.