TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I am trying to Deploy and inference the XLM_Roberta model on TRT-LLM.
I followed the example guide for BERT and built the engine: (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bert)
However, not sure what's to be done next!
For Llama models there is a detailed guide to pass input and perform inferencing, however, BERT models there's no info at all.
So I tried to implement the instructions of llama to BERT model after building engine as follows:
python3 ${COMMON_DIR}tools/fill_template.py -i ${COMMON_DIR}inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${BERT_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 ${COMMON_DIR}tools/fill_template.py -i ${COMMON_DIR}inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${BERT_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 ${COMMON_DIR}tools/fill_template.py -i ${COMMON_DIR}inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 ${COMMON_DIR}tools/fill_template.py -i ${COMMON_DIR}inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 ${COMMON_DIR}tools/fill_template.py -i ${COMMON_DIR}inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_DIR},max_tokens_in_paged_kv_cache:81920,max_attention_window_size:81920,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
python3 /opt/tritonserver/scripts/launch_triton_server.py --world_size 1 --model_repo=/opt/tritonserver/TensorRT_LLM_XLM_RoBERTa/ckpt/xlm-roberta-large/
But it's throwing error:
It would be great if somebody can guide me!