TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
ir = tb.InferenceRequest(i,self.logits_post_processor)
`
Now I use the custom logits_post_processor callback function to control repetition, but it is certainly too slow and inefficient to control with custom python code.
What interface should I use to control n-gram reapeat, rather than repetition_penalty? @lukeyeager @aaronp24 @Superjomn @aflat @seanprime7
` def logits_post_processor(self,req_id: int, logits: torch.Tensor, ids: list, stream: torch.cuda.Stream): cuda_stream = torch.cuda.Stream( stream_id=stream.stream_id, device_index=stream.device_index, device_type=1, # == kCUDA ) with torch.cuda.stream(cuda_stream): current_input_ids = torch.tensor(ids, device=logits.device).unsqueeze(0) logits_processor = NoRepeatNGramLogitsProcessor(no_repeat_ngram_size=8) scores = logits_processor(current_inputids, logits) logits.copy(scores)
ir = tb.InferenceRequest(i,self.logits_post_processor) ` Now I use the custom logits_post_processor callback function to control repetition, but it is certainly too slow and inefficient to control with custom python code. What interface should I use to control n-gram reapeat, rather than repetition_penalty? @lukeyeager @aaronp24 @Superjomn @aflat @seanprime7