NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Timeline for adding IFB support to more models? #1832

Open AndyZZt opened 3 days ago

AndyZZt commented 3 days ago

I noticed that currently only a few series of models, including Qwen, ChatGLM, and GPT, support IFB. The lack of support for other models has severely impacted the practicality of the TRT-LLM framework in production environments. I would like to ask if there is a timeline for adding IFB support to models,e.g. llama series, or if there are guidelines for users to add IFB support for specific models themselves?

### Tasks
- [ ] In-flight batching support for more models
byshiue commented 3 days ago

IFB on LLaMA series models are supported. Do you encounter any issue?

AndyZZt commented 3 days ago

IFB on LLaMA series models are supported. Do you encounter any issue?

When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result: image

Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB. image

hijkzzz commented 3 days ago

IFB on LLaMA series models are supported. Do you encounter any issue?

When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result: image

Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB. image

What is your --max_batch_size (trtllm-build) set to? Have you tried testing it without using Triton Server, for example, by using the following low-level API:

import argparse 
import logging 
import time 
from datetime import datetime, timedelta 
from pathlib import Path 
from threading import Thread 

import tensorrt_llm 
import tensorrt_llm.bindings.executor as trtllm 
from transformers import PreTrainedTokenizerFast 

logger = logging.getLogger(__name__) 

def tensorrt_llm_executor_worker_path() -> str: 
    worker_path = Path(tensorrt_llm.__file__).parent / 'bin' / 'executorWorker' 
    if not worker_path.exists(): 
        raise Exception("TensorRT-LLM executor worker not found") 
    return str(worker_path) 

def get_trt_parallel_config(): 
    world_size = 2 
    if world_size > 1: 
        executor_worker_path = tensorrt_llm_executor_worker_path() 
        orchestrator_config = trtllm.OrchestratorConfig(True, executor_worker_path) 
        return trtllm.ParallelConfig( 
            trtllm.CommunicationType.MPI, 
            trtllm.CommunicationMode.ORCHESTRATOR, 
            orchestrator_config=orchestrator_config, 
            # TODO:BIS fix device_ids 
            device_ids=[0, 1], 
        ) 
    else: 
        return trtllm.ParallelConfig(trtllm.CommunicationType.MPI, trtllm.CommunicationMode.LEADER) 

def create_executor(model_path: str) -> trtllm.Executor: 
    trt_parallel_config = get_trt_parallel_config() 
    trt_scheduler_config = trtllm.SchedulerConfig(trtllm.CapacitySchedulerPolicy.GUARANTEED_NO_EVICT) 

    return trtllm.Executor( 
        Path(model_path), 
        trtllm.ModelType.DECODER_ONLY, 
        trtllm.ExecutorConfig( 
            1, 
            parallel_config=trt_parallel_config, 
            normalize_log_probs=False, 
            batching_type=trtllm.BatchingType.INFLIGHT, 
            scheduler_config=trt_scheduler_config, 
        ), 
    ) 

def create_request(input_ids, output_len, eos_id: int, sample_params): 
    output_config = trtllm.OutputConfig(exclude_input_from_output=True) 
    ## This seems to somewhat resolve the issue 
    # sampling_config = trtllm.SamplingConfig(beam_width=1, frequency_penalty=1.0) 
    request = trtllm.Request( 
        input_token_ids=input_ids, 
        max_new_tokens=output_len, 
        streaming=True, 
        output_config=output_config, 
        end_id=eos_id, 
        sampling_config=sample_params, 
    ) 
    return request 

trt_id = None 

def main(): 
    default_prompt = "You have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!" 
    # default_prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\nYou have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!<|eot_id|><|start_header_id|>assistant<|end_header_id|>" 
    parser = argparse.ArgumentParser() 
    parser.add_argument("--model_path", required=False, default="./tmp/llama3-8b-tp2-engine") 
    parser.add_argument("--tokenizer_path", required=False, default="/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/") 
    parser.add_argument("--prompt", required=False, default=default_prompt) 

    args = parser.parse_args() 

    tokenizer = PreTrainedTokenizerFast.from_pretrained(args.tokenizer_path) 
    executor = create_executor(args.model_path) 
    prompt = args.prompt 
    prompt_ids = tokenizer.encode(prompt) 
    print(prompt_ids) 

    def do_decode(sampling_config): 
        output_ids = [] 
        finished = False 
        req = create_request(prompt_ids, 150, tokenizer.eos_token_id, sampling_config) 
        _ = executor.enqueue_request(req) 
        while not finished: 
            responses = executor.await_responses(timeout=timedelta(seconds=1)) 
            for r in responses: 
                if r.has_error(): 
                    raise RuntimeError(r.error_msg) 
                result = r.result 
                output_ids.extend(result.output_token_ids[0]) 
                if result.is_final: 
                    finished = True 
        return tokenizer.decode(output_ids) 

    print(do_decode(trtllm.SamplingConfig(beam_width=1, top_k=1, random_seed=1234)))     
    print("===================================") 

    executor.shutdown() 

if __name__ == "__main__": 
    main() 
AndyZZt commented 3 days ago

IFB on LLaMA series models are supported. Do you encounter any issue?

When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result: image Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB. image

What is your --max_batch_size (trtllm-build) set to? Have you tried testing it without using Triton Server, for example, by using the following low-level API:

import argparse 
import logging 
import time 
from datetime import datetime, timedelta 
from pathlib import Path 
from threading import Thread 

import tensorrt_llm 
import tensorrt_llm.bindings.executor as trtllm 
from transformers import PreTrainedTokenizerFast 

logger = logging.getLogger(__name__) 

def tensorrt_llm_executor_worker_path() -> str: 
    worker_path = Path(tensorrt_llm.__file__).parent / 'bin' / 'executorWorker' 
    if not worker_path.exists(): 
        raise Exception("TensorRT-LLM executor worker not found") 
    return str(worker_path) 

def get_trt_parallel_config(): 
    world_size = 2 
    if world_size > 1: 
        executor_worker_path = tensorrt_llm_executor_worker_path() 
        orchestrator_config = trtllm.OrchestratorConfig(True, executor_worker_path) 
        return trtllm.ParallelConfig( 
            trtllm.CommunicationType.MPI, 
            trtllm.CommunicationMode.ORCHESTRATOR, 
            orchestrator_config=orchestrator_config, 
            # TODO:BIS fix device_ids 
            device_ids=[0, 1], 
        ) 
    else: 
        return trtllm.ParallelConfig(trtllm.CommunicationType.MPI, trtllm.CommunicationMode.LEADER) 

def create_executor(model_path: str) -> trtllm.Executor: 
    trt_parallel_config = get_trt_parallel_config() 
    trt_scheduler_config = trtllm.SchedulerConfig(trtllm.CapacitySchedulerPolicy.GUARANTEED_NO_EVICT) 

    return trtllm.Executor( 
        Path(model_path), 
        trtllm.ModelType.DECODER_ONLY, 
        trtllm.ExecutorConfig( 
            1, 
            parallel_config=trt_parallel_config, 
            normalize_log_probs=False, 
            batching_type=trtllm.BatchingType.INFLIGHT, 
            scheduler_config=trt_scheduler_config, 
        ), 
    ) 

def create_request(input_ids, output_len, eos_id: int, sample_params): 
    output_config = trtllm.OutputConfig(exclude_input_from_output=True) 
    ## This seems to somewhat resolve the issue 
    # sampling_config = trtllm.SamplingConfig(beam_width=1, frequency_penalty=1.0) 
    request = trtllm.Request( 
        input_token_ids=input_ids, 
        max_new_tokens=output_len, 
        streaming=True, 
        output_config=output_config, 
        end_id=eos_id, 
        sampling_config=sample_params, 
    ) 
    return request 

trt_id = None 

def main(): 
    default_prompt = "You have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!" 
    # default_prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\nYou have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!<|eot_id|><|start_header_id|>assistant<|end_header_id|>" 
    parser = argparse.ArgumentParser() 
    parser.add_argument("--model_path", required=False, default="./tmp/llama3-8b-tp2-engine") 
    parser.add_argument("--tokenizer_path", required=False, default="/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/") 
    parser.add_argument("--prompt", required=False, default=default_prompt) 

    args = parser.parse_args() 

    tokenizer = PreTrainedTokenizerFast.from_pretrained(args.tokenizer_path) 
    executor = create_executor(args.model_path) 
    prompt = args.prompt 
    prompt_ids = tokenizer.encode(prompt) 
    print(prompt_ids) 

    def do_decode(sampling_config): 
        output_ids = [] 
        finished = False 
        req = create_request(prompt_ids, 150, tokenizer.eos_token_id, sampling_config) 
        _ = executor.enqueue_request(req) 
        while not finished: 
            responses = executor.await_responses(timeout=timedelta(seconds=1)) 
            for r in responses: 
                if r.has_error(): 
                    raise RuntimeError(r.error_msg) 
                result = r.result 
                output_ids.extend(result.output_token_ids[0]) 
                if result.is_final: 
                    finished = True 
        return tokenizer.decode(output_ids) 

    print(do_decode(trtllm.SamplingConfig(beam_width=1, top_k=1, random_seed=1234)))     
    print("===================================") 

    executor.shutdown() 

if __name__ == "__main__": 
    main() 

Thank you,I'll try it and post my result here later.

hijkzzz commented 3 days ago

enqueue_request

Just call executor.enqueue_request multiple times and get responses using executor.await_responses.