NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

[phi-3-mini-128k-instruct] Triton launch error with 24.06-trtllm-python-py3: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356) #2021

Open Ryan-ZL-Lin opened 1 month ago

Ryan-ZL-Lin commented 1 month ago

System Info

Who can help?

QiJune @byshiue

Information

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...) Follow official instruction HERE

Reproduction

clone repo, download model and launch tritonserver container

  1. git clone -b v0.11.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  2. cd tensorrtllm_backend
  3. pip install -r /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/requirements.txt
  4. mkdir -p /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  5. git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct /srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  6. docker run --runtime=nvidia -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /srv:/srv nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

model conversion

  1. HF_PHI3_MODEL=/srv/tensorrtllm_backend/tensorrt_llm/examples/phi/phi-3-mini-128k-instruct
  2. UNIFIED_CKPT_PATH=/srv/tensorrtllm_backend/tmp/ckpt/phi/phi-3-mini-128k-instruct
  3. ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/phi/phi-3-mini-128k-instruct/fp16/4-gpu
  4. CONVERT_CHKPT_SCRIPT=/srv/tensorrtllm_backend/tensorrt_llm/examples/phi/convert_checkpoint.py python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_PHI3_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

build model engine trtllm-build \ --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --output_dir ${ENGINE_DIR} \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 1024 \ --max_output_len 1024 \ --tp_size 1 \ --pp_size 1 --context_fmha disable

testing inference result successfully by running this script python3 /srv/tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 500 --tokenizer_dir ${HF_PHI3_MODEL} \ --input_text "<|user|>\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|>\n<|assistant|>" --use_py_session

modify configuration files in preprocessing, postprocessing, tensorrt_llm_bls, ensemble and tensorrt_llm

TOKENIZER_DIR=/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/phi/phi-3-mini-128k-instruct/fp16/4-gpu TOKENIZER_TYPE=auto DECOUPLED_MODE=true MODEL_FOLDER=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm MAX_BATCH_SIZE=4 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=10000 FILL_TEMPLATE_SCRIPT=/srv/tensorrtllm_backend/tools/fill_template.py TRITON_BACKEND=tensorrtllm

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching

launch Triton with world_size = 4 python3 /srv/tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm

Error E0725 08:02:04.639551 3767 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)

byshiue commented 1 month ago

Why do you launch the tritonserver with --world_size=4? It looks that you convert the checkpoint with tp1+pp1, and you also run it on run.py by single GPU.

Ryan-ZL-Lin commented 1 month ago

Hi @byshiue Thanks for looking at this issue. Now I know the world_size should be tp_size x pp_size which are the parameters used in build-engine script.

I changed the world-size to 1 and launch Triton again, the original error is gone but I encountered another error saying " E0725 09:19:49.230391 740 backend_model.cc:692] "ERROR: Failed to create instance: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a local folder or the repo_id of a model on the Hub." in postprocessing and preprocessing. Do you have any rough idea what might be wrong here?

Here is the more completed log:

+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0725 09:19:55.521721 740 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                  |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
|             |                                                                 | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}             |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
|             |                                                                 | e-capability":"6.000000","default-max-batch-size":"4"}}                                                 |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

I0725 09:19:55.521789 740 server.cc:674]
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                                     |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | UNAVAILABLE: Internal: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a |
|                  |         |  local folder or the repo_id of a model on the Hub.                                                                                                        |
|                  |         |                                                                                                                                                            |
|                  |         | At:                                                                                                                                                        |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py(462): cached_file                                                                      |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(637): get_tokenizer_config                                         |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(804): from_pretrained                                              |
|                  |         |   /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm/postprocessing/1/model.py(81): initialize                                                            |
| preprocessing    | 1       | UNAVAILABLE: Internal: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a |
|                  |         |  local folder or the repo_id of a model on the Hub.                                                                                                        |
|                  |         |                                                                                                                                                            |
|                  |         | At:                                                                                                                                                        |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py(462): cached_file                                                                      |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(637): get_tokenizer_config                                         |
|                  |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(804): from_pretrained                                              |
|                  |         |   /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm/preprocessing/1/model.py(81): initialize                                                             |
| tensorrt_llm     | 1       | READY                                                                                                                                                      |
| tensorrt_llm_bls | 1       | READY                                                                                                                                                      |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0725 09:19:55.655074 740 metrics.cc:877] "Collecting metrics for GPU 0: Tesla T4"
I0725 09:19:55.655104 740 metrics.cc:877] "Collecting metrics for GPU 1: Tesla T4"
I0725 09:19:55.655112 740 metrics.cc:877] "Collecting metrics for GPU 2: Tesla T4"
I0725 09:19:55.655119 740 metrics.cc:877] "Collecting metrics for GPU 3: Tesla T4"
I0725 09:19:55.681101 740 metrics.cc:770] "Collecting CPU metrics"
I0725 09:19:55.682461 740 tritonserver.cc:2579]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                               |
| server_version                   | 2.47.0                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
|                                  | mory binary_tensor_data parameters statistics trace logging                                                                                          |
| model_repository_path[0]         | /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm                                                                                                  |
| model_control_mode               | MODE_NONE                                                                                                                                            |
| strict_model_config              | 1                                                                                                                                                    |
| model_config_name                |                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                             |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                             |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                             |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                   |
| cache_enabled                    | 0                                                                                                                                                    |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
Ryan-ZL-Lin commented 1 month ago

oops, I made a mistake on the parameter in config.pbtxt, after fixing that bug I could launch Triton successfully.

Ryan-ZL-Lin commented 1 month ago

Hi @byshiue Just to confirm with you what is the right approach to decide world_size if multi GPUs are used?

With all the setup remaining the same as above, I did an experiment to build model engine with tp4 and pp4 like this:

trtllm-build \
    --checkpoint_dir ${UNIFIED_CKPT_PATH} \
    --output_dir ${ENGINE_DIR} \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 1024 \
    --max_output_len 1024 \
    --context_fmha disable \
    --tp_size 4 \
    --pp_size 4

I thought I should use world_size = 16 this time to lauch Triton server, however, I still got the error [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified

I also tried out world_size 4, and 8 but they were all failed with the same error. The only one successfully launched server is to use world_size 1.

byshiue commented 1 month ago

I guess you use multi-node because you need 16 GPUs. But launch_triton_server.py only supports single GPU case now. For multi-GPU, you need to use mpirun -n 16 tritonserver to launch.

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."