Open AndyZZt opened 3 days ago
IFB on LLaMA series models are supported. Do you encounter any issue?
IFB on LLaMA series models are supported. Do you encounter any issue?
When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result:
Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB.
IFB on LLaMA series models are supported. Do you encounter any issue?
When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result:
Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB.
What is your --max_batch_size (trtllm-build) set to? Have you tried testing it without using Triton Server, for example, by using the following low-level API:
import argparse
import logging
import time
from datetime import datetime, timedelta
from pathlib import Path
from threading import Thread
import tensorrt_llm
import tensorrt_llm.bindings.executor as trtllm
from transformers import PreTrainedTokenizerFast
logger = logging.getLogger(__name__)
def tensorrt_llm_executor_worker_path() -> str:
worker_path = Path(tensorrt_llm.__file__).parent / 'bin' / 'executorWorker'
if not worker_path.exists():
raise Exception("TensorRT-LLM executor worker not found")
return str(worker_path)
def get_trt_parallel_config():
world_size = 2
if world_size > 1:
executor_worker_path = tensorrt_llm_executor_worker_path()
orchestrator_config = trtllm.OrchestratorConfig(True, executor_worker_path)
return trtllm.ParallelConfig(
trtllm.CommunicationType.MPI,
trtllm.CommunicationMode.ORCHESTRATOR,
orchestrator_config=orchestrator_config,
# TODO:BIS fix device_ids
device_ids=[0, 1],
)
else:
return trtllm.ParallelConfig(trtllm.CommunicationType.MPI, trtllm.CommunicationMode.LEADER)
def create_executor(model_path: str) -> trtllm.Executor:
trt_parallel_config = get_trt_parallel_config()
trt_scheduler_config = trtllm.SchedulerConfig(trtllm.CapacitySchedulerPolicy.GUARANTEED_NO_EVICT)
return trtllm.Executor(
Path(model_path),
trtllm.ModelType.DECODER_ONLY,
trtllm.ExecutorConfig(
1,
parallel_config=trt_parallel_config,
normalize_log_probs=False,
batching_type=trtllm.BatchingType.INFLIGHT,
scheduler_config=trt_scheduler_config,
),
)
def create_request(input_ids, output_len, eos_id: int, sample_params):
output_config = trtllm.OutputConfig(exclude_input_from_output=True)
## This seems to somewhat resolve the issue
# sampling_config = trtllm.SamplingConfig(beam_width=1, frequency_penalty=1.0)
request = trtllm.Request(
input_token_ids=input_ids,
max_new_tokens=output_len,
streaming=True,
output_config=output_config,
end_id=eos_id,
sampling_config=sample_params,
)
return request
trt_id = None
def main():
default_prompt = "You have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!"
# default_prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\nYou have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", required=False, default="./tmp/llama3-8b-tp2-engine")
parser.add_argument("--tokenizer_path", required=False, default="/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/")
parser.add_argument("--prompt", required=False, default=default_prompt)
args = parser.parse_args()
tokenizer = PreTrainedTokenizerFast.from_pretrained(args.tokenizer_path)
executor = create_executor(args.model_path)
prompt = args.prompt
prompt_ids = tokenizer.encode(prompt)
print(prompt_ids)
def do_decode(sampling_config):
output_ids = []
finished = False
req = create_request(prompt_ids, 150, tokenizer.eos_token_id, sampling_config)
_ = executor.enqueue_request(req)
while not finished:
responses = executor.await_responses(timeout=timedelta(seconds=1))
for r in responses:
if r.has_error():
raise RuntimeError(r.error_msg)
result = r.result
output_ids.extend(result.output_token_ids[0])
if result.is_final:
finished = True
return tokenizer.decode(output_ids)
print(do_decode(trtllm.SamplingConfig(beam_width=1, top_k=1, random_seed=1234)))
print("===================================")
executor.shutdown()
if __name__ == "__main__":
main()
IFB on LLaMA series models are supported. Do you encounter any issue?
When comparing the performance of TRT-LLM with other inference frameworks, I found that TRT's performance is poor when handling multiple requests. It still processes requests serially under multiple clients, indicating that IFB is not enabled. My testing environment includes an Nvidia A10 GPU, using the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3, TRT-LLM version 0.10.0, in combination with Triton server, and the Llama model LLaMa3-8B-Instruct. Here is our testing result:
Another piece of evidence is that the ./examples/llama/README.md/#supportmatrix does not show support for IFB. In fact, the documentation states that only the Qwen, ChatGLM, and GPT series support IFB.
What is your --max_batch_size (trtllm-build) set to? Have you tried testing it without using Triton Server, for example, by using the following low-level API:
import argparse import logging import time from datetime import datetime, timedelta from pathlib import Path from threading import Thread import tensorrt_llm import tensorrt_llm.bindings.executor as trtllm from transformers import PreTrainedTokenizerFast logger = logging.getLogger(__name__) def tensorrt_llm_executor_worker_path() -> str: worker_path = Path(tensorrt_llm.__file__).parent / 'bin' / 'executorWorker' if not worker_path.exists(): raise Exception("TensorRT-LLM executor worker not found") return str(worker_path) def get_trt_parallel_config(): world_size = 2 if world_size > 1: executor_worker_path = tensorrt_llm_executor_worker_path() orchestrator_config = trtllm.OrchestratorConfig(True, executor_worker_path) return trtllm.ParallelConfig( trtllm.CommunicationType.MPI, trtllm.CommunicationMode.ORCHESTRATOR, orchestrator_config=orchestrator_config, # TODO:BIS fix device_ids device_ids=[0, 1], ) else: return trtllm.ParallelConfig(trtllm.CommunicationType.MPI, trtllm.CommunicationMode.LEADER) def create_executor(model_path: str) -> trtllm.Executor: trt_parallel_config = get_trt_parallel_config() trt_scheduler_config = trtllm.SchedulerConfig(trtllm.CapacitySchedulerPolicy.GUARANTEED_NO_EVICT) return trtllm.Executor( Path(model_path), trtllm.ModelType.DECODER_ONLY, trtllm.ExecutorConfig( 1, parallel_config=trt_parallel_config, normalize_log_probs=False, batching_type=trtllm.BatchingType.INFLIGHT, scheduler_config=trt_scheduler_config, ), ) def create_request(input_ids, output_len, eos_id: int, sample_params): output_config = trtllm.OutputConfig(exclude_input_from_output=True) ## This seems to somewhat resolve the issue # sampling_config = trtllm.SamplingConfig(beam_width=1, frequency_penalty=1.0) request = trtllm.Request( input_token_ids=input_ids, max_new_tokens=output_len, streaming=True, output_config=output_config, end_id=eos_id, sampling_config=sample_params, ) return request trt_id = None def main(): default_prompt = "You have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!" # default_prompt = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\nYou have been tasked with designing a solar-powered water heating system for a residential building. Describe the key components and considerations you would include in your design. Design a five-step workflow.!<|eot_id|><|start_header_id|>assistant<|end_header_id|>" parser = argparse.ArgumentParser() parser.add_argument("--model_path", required=False, default="./tmp/llama3-8b-tp2-engine") parser.add_argument("--tokenizer_path", required=False, default="/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/") parser.add_argument("--prompt", required=False, default=default_prompt) args = parser.parse_args() tokenizer = PreTrainedTokenizerFast.from_pretrained(args.tokenizer_path) executor = create_executor(args.model_path) prompt = args.prompt prompt_ids = tokenizer.encode(prompt) print(prompt_ids) def do_decode(sampling_config): output_ids = [] finished = False req = create_request(prompt_ids, 150, tokenizer.eos_token_id, sampling_config) _ = executor.enqueue_request(req) while not finished: responses = executor.await_responses(timeout=timedelta(seconds=1)) for r in responses: if r.has_error(): raise RuntimeError(r.error_msg) result = r.result output_ids.extend(result.output_token_ids[0]) if result.is_final: finished = True return tokenizer.decode(output_ids) print(do_decode(trtllm.SamplingConfig(beam_width=1, top_k=1, random_seed=1234))) print("===================================") executor.shutdown() if __name__ == "__main__": main()
Thank you,I'll try it and post my result here later.
enqueue_request
Just call executor.enqueue_request
multiple times and get responses using executor.await_responses
.
I noticed that currently only a few series of models, including Qwen, ChatGLM, and GPT, support IFB. The lack of support for other models has severely impacted the practicality of the TRT-LLM framework in production environments. I would like to ask if there is a timeline for adding IFB support to models,e.g. llama series, or if there are guidelines for users to add IFB support for specific models themselves?