Open nikhilcms opened 1 month ago
I think you did not set LD_LIBRARY_PATH env correctly. You can refer this!
Hi @lmcl90 actually libtensorrt_llm.so file missing.. for me inside opt/tritonserver/backends/tensorrtllm below three files present, curious to know which steps responsible to create libtensorrt_llm.so file ? ls /opt/tritonserver/backends/tensorrtllm libtriton_tensorrtllm.so libtriton_tensorrtllm_common.so trtllmExecutorWorker
@nikhilcms libtensorrt_llm.so is a product of compiling TensorRT-LLM source codes and distributed by whl file. You can get it by build and install trt-llm. See line 47 and 60 of that dockerfile.
If you use virtualenv, you will find the so under $VIRTUAL_ENV/lib/python3.10/site-packages/tensorrt_llm/lib
after installation.
Thanks for comment @lmcl90
as I understand i have to follow below steps:
git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git cd tensorrtllm_backend git submodule update --init --recursive git lfs install git lfs pull
docker build --no-cache -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
once Image build, use this image as base image and run 'examples/quantization/quantize.py' and 'trtllm-build' command inside new container to get engine file, and then deploy it using 'tritonserver' or 'tensorrtllm_backend/scripts/launch_triton_server.py'
later same base image will be used for production inference
please correct me is it the correct way to
Hi @lmcl90 with version 0.10.0 I have created docker image..it was around 60GB....its huge
Hi @lmcl90 , could you please confirm same image size i will have to use for building engine file and for inference ?
@nikhilcms I don't know the exact size of the image because I compile and deploy trt-llm directly on the host machine. I suggest you use pre-built docker image for trt-llm backend and you can find the image from here.
Thanks for pointing correct build image @lmcl90
with nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 image
with below version: tensorrt 10.1.0 tensorrt-cu12 10.1.0 tensorrt-cu12-bindings 10.1.0 tensorrt-cu12-libs 10.1.0 tensorrt-llm 0.11.0
works for me
@lmcl90 can you provide me a solution how to use triton server deployed model using openai endpoint ?
Hello, I want to deploy llama-3-8b quantized model using tritonserver I followed below steps to do this:
create container with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 base image.
clone tensorrt repo:
git clone -b v0.10.0 https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM pip install -r examples/llama/requirements.txt
did quantization and engine build using below command:
`python3 examples/quantization/quantize.py --model_dir path/to/original/weights/dir \ --output_dir path/to/store/quantized/weights/dir \ --dtype bfloat16 \ --qformat int4_awq \ --awq_block_size 128 \ --kv_cache_dtype int8 \ --calib_size 32
run build script
trtllm-build --checkpoint_dir path/to/store/quantized/weights/dir \ --output_dir path/to/store/engine/dir \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha enable \ --remove_input_padding enable \ --paged_kv_cache enable \ --max_batch_size 50 \ --max_input_len 3000 \ --max_output_len 3000`
`cd tensorrtllm_backend git submodule update --init --recursive git lfs pull cp /path/to/store/engine/dir/* all_models/inflight_batcher_llm/tensorrt_llm/1/ HF_LLAMA_MODEL=path/to/original/weights/dir ENGINE_PATH=path/to/store/engine/dir
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 `
please correct me where I am wrong due to that above error I'm getting