How to use FP8 of TransformerEngine in inference

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.61k stars 256 forks source link

How to use FP8 of TransformerEngine in inference #922

Open Godlovecui opened 3 weeks ago

Godlovecui commented 3 weeks ago

ENV: RTX 8*4090

I want to test FP8 of TransformerEngine in llama3 (from huggingface) for inference. I can not find instructions on inference. Can you share some code? Thank you~

timmoon10 commented 3 weeks ago

We are working on a tutorial for inference with Gemma: https://github.com/NVIDIA/TransformerEngine/blob/5cb8ed4d129245357363361947e5b1d31c543783/docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb. We're still tweaking it, so we'd appreciate any feedback at https://github.com/NVIDIA/TransformerEngine/pull/829.

Godlovecui commented 2 weeks ago

Hi, Can TransformerEngine be compiled into a pip package，if I want to use transformerEngine in vLLM. Thank you~ @timmoon10