Under the hood this will take the fine-tuned model, convert it to FasterTransformer, and then deploy it using Triton as the Inferencing Server instead of Deepspeed.
FasterTransformer benchmarks at ~9x faster than DeepSpeed and the HuggingFace Transformers library.
Limitations:
Currently only supports T5 based models.
T5-v1_1 architecture models do not produce correct outputs in Inferencing. This includes Flan-T5.
Adds a flag for using FasterTransformer.
Under the hood this will take the fine-tuned model, convert it to FasterTransformer, and then deploy it using Triton as the Inferencing Server instead of Deepspeed.
FasterTransformer benchmarks at ~9x faster than DeepSpeed and the HuggingFace Transformers library.
Limitations: