Open 0xymoro opened 10 months ago
In particular typical p in production environments (our 300k users) has proved to create significantly more natural sequences. The Python code is at line 456 of https://github.com/huggingface/transformers/blob/main/src/transformers/generation/logits_process.py and it is a pretty simple entropy calculation & filtering out the high entropy (unpredictable/off the rails) and low entropy (boring and contributing nothing new) tokens.
I see the sampling is done at a much lower level here and it's pretty different but please let me know if I can help in making some PR. I'm not familiar with cuda programming as I am with python but happy to help if there's any way.
@jerryMeng100
Thanks for sharing the idea.
For sure it is more than welcome for you to make contribution to TensorRT-LLM to add the typical P support. Currently, the community contribution process is(and the process may be iterated and improved based on the concrete feedback we receive):
Pls let us know whether it makes sense to you.
Thanks June
Hi - really interesting work. We're currently using HF TGI in production and exploring using this instead, are there plans to add things like typical_p that transformers supports? Would greatly ease the transition. Thanks!