Plans to implement HF's int8 inference?

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

Apache License 2.0

5.77k stars 882 forks source link

Open JOHW85 opened 2 years ago

JOHW85 commented 2 years ago

Would be great if someone could look into implementing this particular version of int8 for serving LLMs. https://huggingface.co/blog/hf-bitsandbytes-integration

byshiue commented 2 years ago

Thank you for the suggestion. We will consider this optimization.