Closed TeodorPoncu closed 11 months ago
no_repeat_ngram_size
would prohibit the token which lead to repeat_ngram, it will not teminate the generation directly. But when it prohibit some tokens, the end_id may become the most possible candidate.
Thanks for the clarification!
So, to the best of my understanding this meant to be used in conjunction with beam_width > 1
right?
As far as I'm aware this is not currently supported for the Streaming + Inflight Batching configuration with the Triton Server, correct?
You could no_repeat_ngram_size
on both beam search and sampling.
Streaming + Inflight Batching does not support beam search now.
@TeodorPoncu , no_repeat_ngram
is a common concept in logit processing, in TRT-LLM, it has the same behavior as HuggingFace's and FairSeq's definition.
Moreover, one additional feature provided by TRT-LLM on this is it can support flexible ngram control within a batch, i.e, you can provide a [batch_size] no_repeat_ngram_size
tensor and let each sentence in a batch to obey a different ngram size. And TRT-LLM processes this constraint in parallel on GPU for the batch.
Mark as resolved for now. Feel free to reopen if you have further questions.
You could
no_repeat_ngram_size
on both beam search and sampling.Streaming + Inflight Batching does not support beam search now.
So I use Inflight batch, n-gram reapeat can still be used?How can I use n-gram reapeat in Inflight batch .Inflight batch is used as follows:
with tb.GptManager(self.engin_path, modelType, 4, tb.SchedulerPolicy.GUARANTEED_NO_EVICT,
fetch_requests, response_cb, should_stop, stats_cb, opt_params, 10000) as manager:
while remaining_requests > 0:
time.sleep(0.1)
assert manager is not None
assert memory_counters.gpu > init_gpu_mem
First of all, thanks for this amazing package!
Context: We're experimenting with running some rather unruly LLMs (i.e. they love repeating themselves in some cases). Due to the nature of our target generation, using a repeat penalty is a no go.
A previous naive solution we were using with a different deployment option was to just sample two responses from the LLM and check if the initial sample had repetitions or not.
Since we're interested in the streaming use-cases that are integrated with the Triton Server as well, the naive solution we'd have in place for avoiding serving repetitions - is to stream in a delayed fashion back to some front-end application, and keep track where exactly in the LLM answer the repetition started to then make a subsequent stream request if need be.
Question I've looked through this code: https://github.com/NVIDIA/TensorRT-LLM/blob/a21e2f85178111fed9812bb88c2cc7411b25f0ba/tensorrt_llm/runtime/generation.py#L1584
I saw that
no_repeat_ngram_size
influences the value ofshould_stop
, which I assume is a boolean. My questions are as follow:no_repeat_ngram_size
has a set value, will the inference engine regenerate the sequence once it detected a ngram repeat, or just treat it like a standard termination reason (i.e., MAX_SEQ_LEN or EOS token reached)no_repeat_ngram_size
?Thanks!