NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.72k stars 996 forks source link

How to add lora adapter to whisper models? #2149

Closed Jeevi10 closed 2 months ago

Jeevi10 commented 3 months ago

System Info

A100-PCIe-80GB TensorRT-LLM version: 0.13.0.dev2024082000 ubuntu 22.04

Who can help?

@Tracin @n

Information

Tasks

Reproduction

Set variables

INFERENCE_PRECISION = "float16" WEIGHT_ONLY_PRECISION = "int8" MAX_BEAM_WIDTH = 4 MAX_BATCH_SIZE = 8 checkpoint_dir = f"distil_whisper_largeweights{WEIGHT_ONLY_PRECISION}" output_dir = f"distil_whisper_large_en{WEIGHT_ONLY_PRECISION}"

Construct the command

command = f""" python3 TensorRT-LLM/examples/whisper/convert_checkpoint.py \ --use_weight_only \ --weight_only_precision {WEIGHT_ONLY_PRECISION} \ --output_dir {checkpoint_dir} \ --model_name distil-large-v3 """

Run the command

!{command}

INFERENCE_PRECISION = "float16" WEIGHT_ONLY_PRECISION = "int8" MAX_BEAM_WIDTH = 4 MAX_BATCH_SIZE = 8 checkpoint_dir = f"distil_whisper_largeweights{WEIGHT_ONLY_PRECISION}" output_dir = f"distil_whisper_large_en{WEIGHT_ONLY_PRECISION}"

build_command_encoder = f""" trtllm-build \ --checkpoint_dir {checkpoint_dir}/encoder \ --output_dir {output_dir}/encoder \ --paged_kv_cache diable \ --moe_plugin disable \ --lora_plugin {INFERENCE_PRECISION} \ --enable_xqa disable \ --max_batch_size {MAX_BATCH_SIZE} \ --max_lora_rank 8 \ --lora_dir ./whisper-medium-argus_CT_augment_oct/checkpoint-500/adapter_model/ \ --bert_attention_plugin {INFERENCE_PRECISION} \ --gemm_plugin disable \ --remove_input_padding disable \ --max_input_len 1500 \ --lora_target_modules "attn_q" "attn_v" """

build_command_decoder = f""" trtllm-build --checkpoint_dir {checkpoint_dir}/decoder \ --output_dir {output_dir}/decoder \ --paged_kv_cache diable \ --moe_plugin disable \ --lora_plugin {INFERENCE_PRECISION} \ --enable_xqa disable \ --max_beam_width {MAX_BEAM_WIDTH} \ --max_batch_size {MAX_BATCH_SIZE} \ --max_seq_len 114 \ --max_input_len 14 \ --max_encoder_input_len 1500 \ --max_lora_rank 8 \ --lora_dir ./whisper-medium-argus_CT_augment_oct/checkpoint-500/adapter_model/ \ --gemm_plugin {INFERENCE_PRECISION} \ --bert_attention_plugin {INFERENCE_PRECISION} \ --gpt_attention_plugin {INFERENCE_PRECISION} \ --remove_input_padding disable \ --lora_target_modules "attn_q" "attn_v"

"""

print("\nBuilding Encoder with trtllm-build...") !{build_command_encoder}

print("\nBuilding Decoder with trtllm-build...") !{build_command_decoder}

Expected behavior

Engine created succesfully.

actual behavior

Building Encoder with trtllm-build... [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000 [08/23/2024-11:44:31] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_attention_plugin to float16. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set nccl_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set lookup_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set lora_plugin to float16. [08/23/2024-11:44:31] [TRT-LLM] [I] Set moe_plugin to None. [08/23/2024-11:44:31] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/23/2024-11:44:31] [TRT-LLM] [I] Set context_fmha to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set remove_input_padding to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set reduce_fusion to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set enable_xqa to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set tokens_per_block to 64. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set multiple_profiles to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_state to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set streamingllm to False. [08/23/2024-11:44:31] [TRT-LLM] [I] Set use_fused_mlp to True. [08/23/2024-11:44:31] [TRT-LLM] [I] Set paged_kv_cache to diable. [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128 [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/23/2024-11:44:31] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100 [08/23/2024-11:44:37] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/23/2024-11:44:37] [TRT-LLM] [I] Set dtype to float16. [08/23/2024-11:44:37] [TRT-LLM] [W] Overriding paged_state to False [08/23/2024-11:44:37] [TRT-LLM] [I] Set paged_state to False. [08/23/2024-11:44:37] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/23/2024-11:44:37] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. Traceback (most recent call last): File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in sys.exit(main()) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save engine = build_model(build_config, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model return build(model, build_config) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1004, in build model = optimize_model_with_config(model, build_config) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 776, in optimize_model_with_config model.use_lora(build_config.lora_config) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/top_model_mixin.py", line 54, in use_lora raise NotImplementedError("Subclass shall override this") NotImplementedError: Subclass shall override this

Building Decoder with trtllm-build... [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000 [08/23/2024-11:44:42] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_attention_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set nccl_plugin to auto. [08/23/2024-11:44:42] [TRT-LLM] [I] Set lookup_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set lora_plugin to float16. [08/23/2024-11:44:42] [TRT-LLM] [I] Set moe_plugin to None. [08/23/2024-11:44:42] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/23/2024-11:44:42] [TRT-LLM] [I] Set context_fmha to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set remove_input_padding to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set reduce_fusion to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set enable_xqa to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set tokens_per_block to 64. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set multiple_profiles to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set streamingllm to False. [08/23/2024-11:44:42] [TRT-LLM] [I] Set use_fused_mlp to True. [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_kv_cache to diable. [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_prompt_tuning = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ffn_hidden_size = 5120 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = whisper [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rescale_before_lm_head = False [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_hidden_size = 1280 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_num_heads = 20 [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.encoder_head_size = None [08/23/2024-11:44:42] [TRT-LLM] [W] Implicitly setting PretrainedConfig.skip_cross_qkv = False [08/23/2024-11:44:42] [TRT-LLM] [I] Set dtype to float16. [08/23/2024-11:44:42] [TRT-LLM] [W] Overriding paged_state to False [08/23/2024-11:44:42] [TRT-LLM] [I] Set paged_state to False. [08/23/2024-11:44:42] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/23/2024-11:44:44] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 150, GPU 39876 (MiB) [08/23/2024-11:44:46] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1952, GPU +356, now: CPU 2257, GPU 40232 (MiB) [08/23/2024-11:44:46] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/23/2024-11:44:46] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/23/2024-11:44:46] [TRT-LLM] [I] Set nccl_plugin to None. Traceback (most recent call last): File "/home/jalagurajah/anaconda3/envs/WTrt/bin/trtllm-build", line 8, in sys.exit(main()) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 525, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 386, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 353, in build_and_save engine = build_model(build_config, File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 346, in build_model return build(model, build_config) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 1142, in forward hidden_states = decoder_layer( File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 428, in forward attention_output = self.self_attention( File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/layers/attention.py", line 911, in forward context, past_key_value = gpt_attention( File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 561, in wrapper outs = f(args, **kwargs) File "/home/jalagurajah/anaconda3/envs/WTrt/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 4919, in gpt_attention "paged_kv_cache", np.array(paged_kv_cache_flag, dtype=np.int32), ValueError: invalid literal for int() with base 10: 'diable'

additional notes

I want add lora/dora adapters in whisper model

lfr-0531 commented 2 months ago

Similar request to https://github.com/NVIDIA/TensorRT-LLM/issues/2136.