TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
print("\nBuilding Encoder with trtllm-build...")
!{build_command_encoder}
print("\nBuilding Decoder with trtllm-build...")
!{build_command_decoder}
Expected behavior
Engine created succesfully.
actual behavior
Building Encoder with trtllm-build...
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000
[08/23/2024-11:44:31] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_plugin to None.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set nccl_plugin to auto.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set lookup_plugin to None.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set lora_plugin to float16.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set moe_plugin to None.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set context_fmha to True.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set remove_input_padding to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set reduce_fusion to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set enable_xqa to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set tokens_per_block to 64.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set multiple_profiles to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_state to True.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set streamingllm to False.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fused_mlp to True.
[08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_kv_cache to diable.
[08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500
[08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[08/23/2024-11:44:37] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[08/23/2024-11:44:37] [TRT-LLM] [I] Set dtype to float16.
[08/23/2024-11:44:37] [TRT-LLM] [W] Overriding paged_state to False
[08/23/2024-11:44:37] [TRT-LLM] [I] Set paged_state to False.
[08/23/2024-11:44:37] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048
[08/23/2024-11:44:37] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
Traceback (most recent call last):
File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save
engine = build_model(build_config,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model
return build(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1004, in build
model = optimize_model_with_config(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 776, in optimize_model_with_config
model.use_lora(build_config.lora_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/top_model_mixin.py", line 54, in use_lora
raise NotImplementedError("Subclass shall override this")
NotImplementedError: Subclass shall override this
Building Decoder with trtllm-build...
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000
[08/23/2024-11:44:42] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_plugin to float16.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set nccl_plugin to auto.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set lookup_plugin to None.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set lora_plugin to float16.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set moe_plugin to None.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set context_fmha to True.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set remove_input_padding to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set reduce_fusion to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set enable_xqa to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set tokens_per_block to 64.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set multiple_profiles to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to True.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set streamingllm to False.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fused_mlp to True.
[08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_kv_cache to diable.
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_prompt_tuning = False
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = False
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ffn_hidden_size = 5120
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = whisper
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rescale_before_lm_head = False
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_hidden_size = 1280
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_num_heads = 20
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_head_size = None
[08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.skip_cross_qkv = False
[08/23/2024-11:44:42] [TRT-LLM] [I] Set dtype to float16.
[08/23/2024-11:44:42] [TRT-LLM] [W] Overriding paged_state to False
[08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to False.
[08/23/2024-11:44:42] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
[08/23/2024-11:44:44] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 150, GPU 39876 (MiB)
[08/23/2024-11:44:46] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1952, GPU +356, now: CPU 2257, GPU 40232 (MiB)
[08/23/2024-11:44:46] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[08/23/2024-11:44:46] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
[08/23/2024-11:44:46] [TRT-LLM] [I] Set nccl_plugin to None.
Traceback (most recent call last):
File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save
engine = build_model(build_config,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model
return build(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1101, in build
model(inputs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(*args, *kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 1142, in forward
hidden_states = decoder_layer(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(args, kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 428, in forward
attention_output = self.self_attention(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(*args, *kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/layers/attention.py", line 911, in forward
context, past_key_value = gpt_attention(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 561, in wrapper
outs = f(args, **kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 4919, in gpt_attention
"paged_kv_cache", np.array(paged_kv_cache_flag, dtype=np.int32),
ValueError: invalid literal for int() with base 10: 'diable'
System Info
A100-PCIe-80GB TensorRT-LLM version: 0.13.0.dev2024082000 ubuntu 22.04
Who can help?
@Tracin @n
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Set variables
INFERENCE_PRECISION = "float16" WEIGHT_ONLY_PRECISION = "int8" MAX_BEAM_WIDTH = 4 MAX_BATCH_SIZE = 8 checkpoint_dir = f"distil_whisper_largeweights{WEIGHT_ONLY_PRECISION}" output_dir = f"distil_whisper_large_en{WEIGHT_ONLY_PRECISION}"
Construct the command
command = f""" python3 TensorRT-LLM/examples/whisper/convert_checkpoint.py \ --use_weight_only \ --weight_only_precision {WEIGHT_ONLY_PRECISION} \ --output_dir {checkpoint_dir} \ --model_name distil-large-v3 """
Run the command
!{command}
INFERENCE_PRECISION = "float16" WEIGHT_ONLY_PRECISION = "int8" MAX_BEAM_WIDTH = 4 MAX_BATCH_SIZE = 8 checkpoint_dir = f"distil_whisper_largeweights{WEIGHT_ONLY_PRECISION}" output_dir = f"distil_whisper_large_en{WEIGHT_ONLY_PRECISION}"
build_command_encoder = f""" trtllm-build \ --checkpoint_dir {checkpoint_dir}/encoder \ --output_dir {output_dir}/encoder \ --paged_kv_cache diable \ --moe_plugin disable \ --lora_plugin {INFERENCE_PRECISION} \ --enable_xqa disable \ --max_batch_size {MAX_BATCH_SIZE} \ --max_lora_rank 8 \ --lora_dir ./whisper-medium-argus_CT_augment_oct/checkpoint-500/adapter_model/ \ --bert_attention_plugin {INFERENCE_PRECISION} \ --gemm_plugin disable \ --remove_input_padding disable \ --max_input_len 1500 \ --lora_target_modules "attn_q" "attn_v" """
build_command_decoder = f""" trtllm-build --checkpoint_dir {checkpoint_dir}/decoder \ --output_dir {output_dir}/decoder \ --paged_kv_cache diable \ --moe_plugin disable \ --lora_plugin {INFERENCE_PRECISION} \ --enable_xqa disable \ --max_beam_width {MAX_BEAM_WIDTH} \ --max_batch_size {MAX_BATCH_SIZE} \ --max_seq_len 114 \ --max_input_len 14 \ --max_encoder_input_len 1500 \ --max_lora_rank 8 \ --lora_dir ./whisper-medium-argus_CT_augment_oct/checkpoint-500/adapter_model/ \ --gemm_plugin {INFERENCE_PRECISION} \ --bert_attention_plugin {INFERENCE_PRECISION} \ --gpt_attention_plugin {INFERENCE_PRECISION} \ --remove_input_padding disable \ --lora_target_modules "attn_q" "attn_v"
"""
print("\nBuilding Encoder with trtllm-build...") !{build_command_encoder}
print("\nBuilding Decoder with trtllm-build...") !{build_command_decoder}
Expected behavior
Engine created succesfully.
actual behavior
Building Encoder with trtllm-build... [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000 [08/23/2024-11:44:31] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_attention_plugin to float16. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set nccl_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set lookup_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set lora_plugin to float16. [08/23/2024-11:44:31] [TRT-LLM] [I] Set moe_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set context_fmha to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set remove_input_padding to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set reduce_fusion to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set enable_xqa to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set tokens_per_block to 64. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set multiple_profiles to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_state to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set streamingllm to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fused_mlp to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_kv_cache to diable. [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128 [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100 [08/23/2024-11:44:37] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/23/2024-11:44:37] [TRT-LLM] [I] Set dtype to float16. [08/23/2024-11:44:37] [TRT-LLM] [W] Overriding paged_state to False [08/23/2024-11:44:37] [TRT-LLM] [I] Set paged_state to False. [08/23/2024-11:44:37] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/23/2024-11:44:37] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. Traceback (most recent call last): File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save
engine = build_model(build_config,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model
return build(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1004, in build
model = optimize_model_with_config(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 776, in optimize_model_with_config
model.use_lora(build_config.lora_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/top_model_mixin.py", line 54, in use_lora
raise NotImplementedError("Subclass shall override this")
NotImplementedError: Subclass shall override this
Building Decoder with trtllm-build... [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000 [08/23/2024-11:44:42] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_attention_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set nccl_plugin to auto. [08/23/2024-11:44:42] [TRT-LLM] [I] Set lookup_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set lora_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set moe_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/23/2024-11:44:42] [TRT-LLM] [I] Set context_fmha to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set remove_input_padding to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set reduce_fusion to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set enable_xqa to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set tokens_per_block to 64. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set multiple_profiles to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set streamingllm to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fused_mlp to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_kv_cache to diable. [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_prompt_tuning = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ffn_hidden_size = 5120 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = whisper [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rescale_before_lm_head = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_hidden_size = 1280 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_num_heads = 20 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_head_size = None [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.skip_cross_qkv = False [08/23/2024-11:44:42] [TRT-LLM] [I] Set dtype to float16. [08/23/2024-11:44:42] [TRT-LLM] [W] Overriding paged_state to False [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to False. [08/23/2024-11:44:42] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/23/2024-11:44:44] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 150, GPU 39876 (MiB) [08/23/2024-11:44:46] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1952, GPU +356, now: CPU 2257, GPU 40232 (MiB) [08/23/2024-11:44:46] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/23/2024-11:44:46] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/23/2024-11:44:46] [TRT-LLM] [I] Set nccl_plugin to None. Traceback (most recent call last): File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in
sys.exit(main())
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save
engine = build_model(build_config,
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model
return build(model, build_config)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1101, in build
model(inputs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(*args, *kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 1142, in forward
hidden_states = decoder_layer(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(args, kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 428, in forward
attention_output = self.self_attention(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(*args, *kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/layers/attention.py", line 911, in forward
context, past_key_value = gpt_attention(
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 561, in wrapper
outs = f(args, **kwargs)
File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 4919, in gpt_attention
"paged_kv_cache", np.array(paged_kv_cache_flag, dtype=np.int32),
ValueError: invalid literal for int() with base 10: 'diable'
additional notes
I want add lora/dora adapters in whisper model