nikhilcms commented 1 month ago

Hello, I want to deploy llama-3-8b quantized model using tritonserver I followed below steps to do this:

create container with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 base image.
clone tensorrt repo: git clone -b v0.10.0 https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM pip install -r examples/llama/requirements.txt
did quantization and engine build using below command:
`python3 examples/quantization/quantize.py --model_dir path/to/original/weights/dir \ --output_dir path/to/store/quantized/weights/dir \ --dtype bfloat16 \ --qformat int4_awq \ --awq_block_size 128 \ --kv_cache_dtype int8 \ --calib_size 32

run build script

trtllm-build --checkpoint_dir path/to/store/quantized/weights/dir \ --output_dir path/to/store/engine/dir \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha enable \ --remove_input_padding enable \ --paged_kv_cache enable \ --max_batch_size 50 \ --max_input_len 3000 \ --max_output_len 3000`

clone tensorrt backend, copy weights : `git clone -b v0.10.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
`cd tensorrtllm_backend git submodule update --init --recursive git lfs pull cp /path/to/store/engine/dir/* all_models/inflight_batcher_llm/tensorrt_llm/1/ HF_LLAMA_MODEL=path/to/original/weights/dir ENGINE_PATH=path/to/store/engine/dir

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 `
1. python3 scripts/launch_triton_server.py --model_repo all_models/inflight_batcher_llm --world_size 1

but above deployment command thow below error: E0723 07:49:34.087766 4670 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Not found: unable to load shared library: libtensorrt_llm.so: cannot open shared object file: No such file or directory"

please correct me where I am wrong due to that above error I'm getting

lmcl90 commented 1 month ago

I think you did not set LD_LIBRARY_PATH env correctly. You can refer this!

nikhilcms commented 1 month ago

Hi @lmcl90 actually libtensorrt_llm.so file missing.. for me inside opt/tritonserver/backends/tensorrtllm below three files present, curious to know which steps responsible to create libtensorrt_llm.so file ? ls /opt/tritonserver/backends/tensorrtllm libtriton_tensorrtllm.so libtriton_tensorrtllm_common.so trtllmExecutorWorker

lmcl90 commented 1 month ago

@nikhilcms libtensorrt_llm.so is a product of compiling TensorRT-LLM source codes and distributed by whl file. You can get it by build and install trt-llm. See line 47 and 60 of that dockerfile. If you use virtualenv, you will find the so under $VIRTUAL_ENV/lib/python3.10/site-packages/tensorrt_llm/lib after installation.

nikhilcms commented 1 month ago

Thanks for comment @lmcl90

as I understand i have to follow below steps:

git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git cd tensorrtllm_backend git submodule update --init --recursive git lfs install git lfs pull

docker build --no-cache -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

once Image build, use this image as base image and run 'examples/quantization/quantize.py' and 'trtllm-build' command inside new container to get engine file, and then deploy it using 'tritonserver' or 'tensorrtllm_backend/scripts/launch_triton_server.py'

later same base image will be used for production inference

please correct me is it the correct way to

nikhilcms commented 1 month ago

Hi @lmcl90 with version 0.10.0 I have created docker image..it was around 60GB....its huge

nikhilcms commented 1 month ago

Hi @lmcl90 , could you please confirm same image size i will have to use for building engine file and for inference ?

lmcl90 commented 1 month ago

@nikhilcms I don't know the exact size of the image because I compile and deploy trt-llm directly on the host machine. I suggest you use pre-built docker image for trt-llm backend and you can find the image from here.

nikhilcms commented 1 month ago

Thanks for pointing correct build image @lmcl90

with nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 image

with below version: tensorrt 10.1.0 tensorrt-cu12 10.1.0 tensorrt-cu12-bindings 10.1.0 tensorrt-cu12-libs 10.1.0 tensorrt-llm 0.11.0

works for me

nikhilcms commented 1 month ago

@lmcl90 can you provide me a solution how to use triton server deployed model using openai endpoint ?

lmcl90 commented 1 month ago

@nikhilcms Have you seen this document?

NVIDIA / TensorRT-LLM

Not found: unable to load shared library: libtensorrt_llm.so: cannot open shared object file: No such file or directory #2004

run build script