AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
This change adds quantization support so we can use quantization techniques like eetq, bitsandbytes, etc. when deploying models using TGI. It allows us to benchmark models like Llama3 405B using FP8 too.
This change adds quantization support so we can use quantization techniques like eetq, bitsandbytes, etc. when deploying models using TGI. It allows us to benchmark models like Llama3 405B using FP8 too.