Open kelkarn opened 7 months ago
Hi, for Phi-2. please use the commands:
python ./convert_checkpoint.py --model_dir "microsoft/phi-2" --output_dir ./phi-2-checkpoint --dtype float16
--tp_size should be set with trtllm-build
trtllm-build \
--checkpoint_dir ./phi-2-checkpoint \
--output_dir ./phi-2-engine \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_output_len 1024 \
--tp_size 2 \
--workers 2
Because of historical reasons, this model has some special usage cases.
@hijkzzz - that does not work for me; I get this error:
usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE]
[--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}]
[--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN]
[--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS]
[--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits]
[--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}]
[--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}]
[--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}]
[--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}]
[--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}]
[--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}]
[--use_context_fmha_for_generation {enable,disable}]
trtllm-build: error: unrecognized arguments: --tp_size 2
I am using TRT-LLM v0.8.0
in a 24.02-trtllm-python-py3
Triton container.
@hijkzzz - that does not work for me; I get this error:
usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}] trtllm-build: error: unrecognized arguments: --tp_size 2
I am using TRT-LLM
v0.8.0
in a24.02-trtllm-python-py3
Triton container.
please use the latest TRT-LLM
@hijkzzz is that compatible with Triton 24.02
? The support matrix says that only version v0.8.0 of TRT-LLM is compatible with Triton 24.02.
@byshiue - can you please help me understand what the resolution here is? Are we saying that the Phi-2 with TRT-LLM v0.8.0 and with A100 (160GB) and on Triton server 24.02 is not expected to work?
@kelkarn Could you please try to use the latest version of TrtLLM?
Environment
If applicable, please include the following:
CPU architecture: x86_64 CPU/Host memory size: 440 GiB memory
GPU properties
GPU name: A100 GPU memory size: 160GB I am using the Azure offering of this GPU: Standard NC48ads A100 v4 (48 vcpus, 440 GiB memory)
Libraries
TensorRT-LLM branch or tag: v0.8.0 Container used: 24.02-trtllm-python-py3
NVIDIA driver version: Driver Version: 535.161.07
OS: Ubuntu 22.04 (Jammy)
Reproduction Steps
Followed steps here: https://github.com/NVIDIA/TensorRT-LLM/tree/5955b8afbad2ddcc3156202b16c567e94c52248f/examples/phi
From within the
examples/phi
folder:Build checkpoint with tp_size = 2, pp_size = 1
Build engine
Run in Triton (after copying to models folder)
Expected Behavior
I expected Triton server to start normally and show the GRPC/Metrics/HTTP port numbers at the end (8001, 8002, 8000).
Actual Behavior
Triton server just hangs. I am using the
24.02-trtllm-python-py3
version. Here are the raw logs with--log-verbose=1
:After a few minutes passed, I tried calling the endpoint but it seems it doesn't work because it is hung up:
Additional Notes
I used Phi-2 from here: https://huggingface.co/microsoft/phi-2
I wonder if this is similar to this other issue? - https://github.com/triton-inference-server/tensorrtllm_backend/issues/377
Would be great if Nvidia can repro this.