NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Add Transformers logits manipulators #241

Open 0xymoro opened 10 months ago

0xymoro commented 10 months ago

Hi - really interesting work. We're currently using HF TGI in production and exploring using this instead, are there plans to add things like typical_p that transformers supports? Would greatly ease the transition. Thanks!

0xymoro commented 10 months ago

In particular typical p in production environments (our 300k users) has proved to create significantly more natural sequences. The Python code is at line 456 of https://github.com/huggingface/transformers/blob/main/src/transformers/generation/logits_process.py and it is a pretty simple entropy calculation & filtering out the high entropy (unpredictable/off the rails) and low entropy (boring and contributing nothing new) tokens.

I see the sampling is done at a much lower level here and it's pretty different but please let me know if I can help in making some PR. I'm not familiar with cuda programming as I am with python but happy to help if there's any way.

juney-nvidia commented 10 months ago

@jerryMeng100

Thanks for sharing the idea.

For sure it is more than welcome for you to make contribution to TensorRT-LLM to add the typical P support. Currently, the community contribution process is(and the process may be iterated and improved based on the concrete feedback we receive):

Pls let us know whether it makes sense to you.

Thanks June