In the previous implementation, the encode_image methods of the SPHINX series models return FP32 features. This promotes the concatenated (image+text) embeddings to FP32 and leads to a few GBs of memory overhead, which in turn makes the 4-bit quantized SPHINX OOM on 24GB GPUs.
With the dtype explicitly casted back the models now run fine on 24GB GPUs (NF4 quantized) at the longest sequence length (4k)
Also pinning gradio to 3.x for now as gradio 4.x seem to introduce compatibility issues. Will bump the version in the future after some testings.
In the previous implementation, the
encode_image
methods of the SPHINX series models return FP32 features. This promotes the concatenated (image+text) embeddings to FP32 and leads to a few GBs of memory overhead, which in turn makes the 4-bit quantized SPHINX OOM on 24GB GPUs.With the dtype explicitly casted back the models now run fine on 24GB GPUs (NF4 quantized) at the longest sequence length (4k)
Also pinning gradio to 3.x for now as gradio 4.x seem to introduce compatibility issues. Will bump the version in the future after some testings.