Closed philschmid closed 2 months ago
New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Will you update TEI to v1.3.0 as well?
please rebase on this latest main branch
New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)