Open Ryan-ZL-Lin opened 1 month ago
Why do you launch the tritonserver with --world_size=4
? It looks that you convert the checkpoint with tp1+pp1, and you also run it on run.py
by single GPU.
Hi @byshiue Thanks for looking at this issue. Now I know the world_size
should be tp_size x pp_size
which are the parameters used in build-engine script.
I changed the world-size to 1 and launch Triton again, the original error is gone but I encountered another error saying "
E0725 09:19:49.230391 740 backend_model.cc:692] "ERROR: Failed to create instance: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
" in postprocessing and preprocessing. Do you have any rough idea what might be wrong here?
Here is the more completed log:
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0725 09:19:55.521721 740 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
I0725 09:19:55.521789 740 server.cc:674]
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | UNAVAILABLE: Internal: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a |
| | | local folder or the repo_id of a model on the Hub. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py(462): cached_file |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(637): get_tokenizer_config |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(804): from_pretrained |
| | | /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm/postprocessing/1/model.py(81): initialize |
| preprocessing | 1 | UNAVAILABLE: Internal: OSError: Incorrect path_or_model_id: '/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct'. Please provide either the path to a |
| | | local folder or the repo_id of a model on the Hub. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py(462): cached_file |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(637): get_tokenizer_config |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(804): from_pretrained |
| | | /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm/preprocessing/1/model.py(81): initialize |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0725 09:19:55.655074 740 metrics.cc:877] "Collecting metrics for GPU 0: Tesla T4"
I0725 09:19:55.655104 740 metrics.cc:877] "Collecting metrics for GPU 1: Tesla T4"
I0725 09:19:55.655112 740 metrics.cc:877] "Collecting metrics for GPU 2: Tesla T4"
I0725 09:19:55.655119 740 metrics.cc:877] "Collecting metrics for GPU 3: Tesla T4"
I0725 09:19:55.681101 740 metrics.cc:770] "Collecting CPU metrics"
I0725 09:19:55.682461 740 tritonserver.cc:2579]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.47.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| model_config_name | |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
oops, I made a mistake on the parameter in config.pbtxt, after fixing that bug I could launch Triton successfully.
Hi @byshiue
Just to confirm with you what is the right approach to decide world_size
if multi GPUs are used?
With all the setup remaining the same as above, I did an experiment to build model engine with tp4 and pp4 like this:
trtllm-build \
--checkpoint_dir ${UNIFIED_CKPT_PATH} \
--output_dir ${ENGINE_DIR} \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_output_len 1024 \
--context_fmha disable \
--tp_size 4 \
--pp_size 4
I thought I should use world_size = 16 this time to lauch Triton server, however, I still got the error
[TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified
I also tried out world_size 4, and 8 but they were all failed with the same error. The only one successfully launched server is to use world_size 1.
I guess you use multi-node because you need 16 GPUs. But launch_triton_server.py
only supports single GPU case now. For multi-GPU, you need to use mpirun -n 16 tritonserver
to launch.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
Who can help?
QiJune @byshiue
Information
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...) Follow official instruction HERE
Reproduction
clone repo, download model and launch tritonserver container
model conversion
build model engine trtllm-build \ --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --output_dir ${ENGINE_DIR} \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 1024 \ --max_output_len 1024 \ --tp_size 1 \ --pp_size 1 --context_fmha disable
testing inference result successfully by running this script python3 /srv/tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 500 --tokenizer_dir ${HF_PHI3_MODEL} \ --input_text "<|user|>\nCan you provide ways to eat combinations of bananas and dragonfruits?<|end|>\n<|assistant|>" --use_py_session
modify configuration files in preprocessing, postprocessing, tensorrt_llm_bls, ensemble and tensorrt_llm
TOKENIZER_DIR=/srv/TensorRT-LLM/examples/phi/phi-3-mini-128k-instruct ENGINE_DIR=/srv/tensorrtllm_backend/tmp/engine/phi/phi-3-mini-128k-instruct/fp16/4-gpu TOKENIZER_TYPE=auto DECOUPLED_MODE=true MODEL_FOLDER=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm MAX_BATCH_SIZE=4 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=10000 FILL_TEMPLATE_SCRIPT=/srv/tensorrtllm_backend/tools/fill_template.py TRITON_BACKEND=tensorrtllm
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
launch Triton with world_size = 4 python3 /srv/tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/srv/Phi3_mini_128k_Model_Repo/inflight_batcher_llm
Error E0725 08:02:04.639551 3767 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tp*pp when participantIds are not specified (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/executor/executorImpl.cpp:356)