NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.44k stars 802 forks source link

How can I use no_repeat_ngram_tensor when using Inflight Batch to control the repetition problem? #1684

Open huangizi opened 1 month ago

huangizi commented 1 month ago

` def logits_post_processor(self,req_id: int, logits: torch.Tensor, ids: list, stream: torch.cuda.Stream): cuda_stream = torch.cuda.Stream( stream_id=stream.stream_id, device_index=stream.device_index, device_type=1, # == kCUDA ) with torch.cuda.stream(cuda_stream): current_input_ids = torch.tensor(ids, device=logits.device).unsqueeze(0) logits_processor = NoRepeatNGramLogitsProcessor(no_repeat_ngram_size=8) scores = logits_processor(current_inputids, logits) logits.copy(scores)

ir = tb.InferenceRequest(i,self.logits_post_processor) ` Now I use the custom logits_post_processor callback function to control repetition, but it is certainly too slow and inefficient to control with custom python code. What interface should I use to control n-gram reapeat, rather than repetition_penalty? @lukeyeager @aaronp24 @Superjomn @aflat @seanprime7

byshiue commented 1 month ago

The feature is not exposed now. We will support it soon.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."