NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.69k stars 993 forks source link

What does `no_repeat_ngram_size` exactly do? #492

Closed TeodorPoncu closed 11 months ago

TeodorPoncu commented 11 months ago

First of all, thanks for this amazing package!

Context: We're experimenting with running some rather unruly LLMs (i.e. they love repeating themselves in some cases). Due to the nature of our target generation, using a repeat penalty is a no go.

A previous naive solution we were using with a different deployment option was to just sample two responses from the LLM and check if the initial sample had repetitions or not.

Since we're interested in the streaming use-cases that are integrated with the Triton Server as well, the naive solution we'd have in place for avoiding serving repetitions - is to stream in a delayed fashion back to some front-end application, and keep track where exactly in the LLM answer the repetition started to then make a subsequent stream request if need be.

Question I've looked through this code: https://github.com/NVIDIA/TensorRT-LLM/blob/a21e2f85178111fed9812bb88c2cc7411b25f0ba/tensorrt_llm/runtime/generation.py#L1584

I saw that no_repeat_ngram_size influences the value of should_stop, which I assume is a boolean. My questions are as follow:

  1. If no_repeat_ngram_size has a set value, will the inference engine regenerate the sequence once it detected a ngram repeat, or just treat it like a standard termination reason (i.e., MAX_SEQ_LEN or EOS token reached)
  2. If the inference engine just terminates the generation, is there any way to find out if the termination reason was caused by no_repeat_ngram_size?

Thanks!

byshiue commented 11 months ago

no_repeat_ngram_size would prohibit the token which lead to repeat_ngram, it will not teminate the generation directly. But when it prohibit some tokens, the end_id may become the most possible candidate.

TeodorPoncu commented 11 months ago

Thanks for the clarification!

So, to the best of my understanding this meant to be used in conjunction with beam_width > 1 right?

As far as I'm aware this is not currently supported for the Streaming + Inflight Batching configuration with the Triton Server, correct?

byshiue commented 11 months ago

You could no_repeat_ngram_size on both beam search and sampling.

Streaming + Inflight Batching does not support beam search now.

symphonylyh commented 11 months ago

@TeodorPoncu , no_repeat_ngram is a common concept in logit processing, in TRT-LLM, it has the same behavior as HuggingFace's and FairSeq's definition.

Moreover, one additional feature provided by TRT-LLM on this is it can support flexible ngram control within a batch, i.e, you can provide a [batch_size] no_repeat_ngram_size tensor and let each sentence in a batch to obey a different ngram size. And TRT-LLM processes this constraint in parallel on GPU for the batch.

symphonylyh commented 11 months ago

Mark as resolved for now. Feel free to reopen if you have further questions.

huangizi commented 5 months ago

You could no_repeat_ngram_size on both beam search and sampling.

Streaming + Inflight Batching does not support beam search now.

So I use Inflight batch, n-gram reapeat can still be used?How can I use n-gram reapeat in Inflight batch .Inflight batch is used as follows:

  with tb.GptManager(self.engin_path, modelType, 4, tb.SchedulerPolicy.GUARANTEED_NO_EVICT,
          fetch_requests, response_cb, should_stop, stats_cb, opt_params, 10000) as manager:
      while remaining_requests > 0:
          time.sleep(0.1)
      assert manager is not None
      assert memory_counters.gpu > init_gpu_mem