TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[phi-3-mini-128k-instruct] Triton launch error with 24.06-trtllm-python-py3: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)

Ryan-ZL-Lin commented 1 month ago

System Info

clone repo, download model and launch tritonserver container

  1. git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  2. cd tensorrtllm_backend
  3. pip install -r /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/requirements.txt
  4. mkdir -p /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  5. git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  6. docker run --runtime=nvidia -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /srv:/srv nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

model conversion

  1. HF_PHI3_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  2. UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/phi/phi-3-mini-128k-instruct
  3. ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/phi/phi-3-mini-128k-instruct/fp16/4-gpu
  4. CONVERT_CHKPT_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/phi/convert_checkpoint.py python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_PHI3_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

build model engine trtllm-build \ --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --output_dir ${ENGINE_DIR} \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 1024 \ --max_output_len 1024 \ --tp_size 1 \ --pp_size 1 --context_fmha disable

testing inference result successfully by running this script python3 /srv/tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 500 --tokenizer_dir ${HF_PHI3_MODEL} \ --input_text "<|user|>\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|>\n<|assistant|>" --use_py_session

modify configuration files in preprocessing, postprocessing, tensorrt_llm_bls, ensemble and tensorrt_llm

TOKENIZER_DIR=/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/phi/phi-3-mini-128k-instruct/fp16/4-gpu TOKENIZER_TYPE=auto DECOUPLED_MODE=true MODEL_FOLDER=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm MAX_BATCH_SIZE=4 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=10000 FILL_TEMPLATE_SCRIPT=/srv/tensorrtllm_backend/tools/fill_template.py TRITON_BACKEND=tensorrtllm

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching

launch Triton with world_size = 4 python3 /srv/tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm

Error E0725 08:02:04.639551 3767 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)

byshiue commented 1 month ago

Why do you launch the tritonserver with --world_size=4? It looks that you convert the checkpoint with tp1+pp1, and you also run it on run.py by single GPU.

Ryan-ZL-Lin commented 1 month ago

Hi @byshiue Thanks for looking at this issue. Now I know the world_size should be tp_size x pp_size which are the parameters used in build-engine script.

I changed the world-size to 1 and launch Triton again, the original error is gone but I encountered another error saying " E0725 09:19:49.230391 740 backend_model.cc:692] "ERROR: Failed to create instance: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a local folder or the repo_id of a model on the Hub." in postprocessing and preprocessing. Do you have any rough idea what might be wrong here?

Here is the more completed log:

Ryan-ZL-Lin commented 1 month ago

oops, I made a mistake on the parameter in config.pbtxt, after fixing that bug I could launch Triton successfully.

Ryan-ZL-Lin commented 1 month ago

Hi @byshiue Just to confirm with you what is the right approach to decide world_size if multi GPUs are used?

With all the setup remaining the same as above, I did an experiment to build model engine with tp4 and pp4 like this:

trtllm-build \
    --checkpoint_dir ${UNIFIED_CKPT_PATH} \
    --output_dir ${ENGINE_DIR} \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_output_len 1024 \
    --context_fmha disable \
    --tp_size 4 \
    --pp_size 4

I thought I should use world_size = 16 this time to lauch Triton server, however, I still got the error [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified

I also tried out world_size 4, and 8 but they were all failed with the same error. The only one successfully launched server is to use world_size 1.

byshiue commented 1 month ago

I guess you use multi-node because you need 16 GPUs. But launch_triton_server.py only supports single GPU case now. For multi-GPU, you need to use mpirun -n 16 tritonserver to launch.

github-actions[bot] commented 4 days ago

