Fix SPHINX inference memory with image input

In the previous implementation, the encode_image methods of the SPHINX series models return FP32 features. This promotes the concatenated (image+text) embeddings to FP32 and leads to a few GBs of memory overhead, which in turn makes the 4-bit quantized SPHINX OOM on 24GB GPUs.

With the dtype explicitly casted back the models now run fine on 24GB GPUs (NF4 quantized) at the longest sequence length (4k)

Also pinning gradio to 3.x for now as gradio 4.x seem to introduce compatibility issues. Will bump the version in the future after some testings.

Alpha-VLLM / LLaMA2-Accessory

Fix SPHINX inference memory with image input #116