NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.47k stars 957 forks source link

TRT-LLM Support for Llama3.2 #2320

Open JoJoLev opened 1 week ago

JoJoLev commented 1 week ago

Is there support for llama3.2 with TensorRT-LLM? I tried engine build but got a rope error? Maybe it is related to the context length? Thanks.

imihic commented 1 week ago

Are you running TensorRT-LLM docker container? Try running pip install --upgrade transformers in the container before converting/building the Llama 3.2 engine.

JoJoLev commented 1 week ago

TRT-LLM highest transformers version is 4.42.4. Llama3.2 in the config.json wants transformers 4.45.

malikkirchner commented 1 week ago

I tried to build quantizations for the Llama 3.2 models. Following works for meta-llama/Llama-3.2-3B-Instruct:

time \
../scripts/quantize.py  --model_dir "${MODEL_PATH}" \
                        --output_dir "${QUANT_PATH}-int8-kv-int4-awq" \
                        --dtype float16 \
                        --qformat w4a8_awq \
                        --awq_block_size 128 \
                        --kv_cache_dtype int8 \
                        --calib_size 32 \
                        --tp_size 4

time \
trtllm-build    --checkpoint_dir "${QUANT_PATH}-int8-kv-int4-awq" \
                --output_dir "${ENGINE_PATH}-int8-kv-int4-awq" \
                --gemm_plugin auto \
                --weight_streaming \
            --max_batch_size 8

time \
mpirun -n 4 --allow-run-as-root ../scripts/summarize.py --test_trt_llm \
                                                        --hf_model_dir "${MODEL_PATH}" \
                                                --tokenizer_dir "${MODEL_PATH}" \
                                                        --data_type fp16 \
                                                        --engine_dir "${ENGINE_PATH}-int8-kv-int4-awq" \
                                                        --test_hf

Using the tensorrt-llm==0.13.0 with transformers==4.45.2 and based on: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.13.0/examples/llama#llama-v3-updates

Running the same quantization on meta-llama/Llama-3.2-11B-Vision-Instruct yields:

# trtllm-build --checkpoint_dir quantizations/meta-llama--llama-3.2-11b-vision-instruct-int8-kv-int4-awq --output_dir engines/meta-llama--llama-3.2-11b-vision-instruct-int8-kv-int4-awq --gemm_plugin auto --gpt_attention_plugin bfloat16 --weight_streaming --max_batch_size 8
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
[10/12/2024-16:47:32] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gemm_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set nccl_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set lookup_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set lora_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set moe_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set context_fmha to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set remove_input_padding to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set reduce_fusion to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set enable_xqa to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set tokens_per_block to 64.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set multiple_profiles to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set paged_state to True.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set streamingllm to False.
[10/12/2024-16:47:32] [TRT-LLM] [I] Set use_fused_mlp to True.
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.15.1'}
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[10/12/2024-16:47:32] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 575, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 429, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 396, in build_and_save
    engine = build_model(build_config,
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 365, in build_model
    model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 483, in from_checkpoint
    preprocess_weights(weights, config, from_pruned=is_checkpoint_pruned)
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 1228, in preprocess_weights
    activation_scaling_factor = weights.pop(
KeyError: 'transformer.layers.13.attention.dense.activation_scaling_factor'

Are "Vision" models supported? Or is this a support for Llama 3.2 issue? Got the same results for TensorRT-LLM 0.14.0.dev2024100800.

The plain checkpoint extraction also fails for vision models:

+ ../scripts/convert_checkpoint.py --model_dir /home/runner/.cache/huggingface/hub/models--meta-llama--Llama-3.2-11B-Vision/snapshots/3f2e93603aaa5dd142f27d34b06dfa2b6e97b8be --output_dir checkpoints/meta-llama--llama-3.2-11b-vision --dtype float16
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
0.13.0
Traceback (most recent call last):
  File "/workspace/models/../scripts/convert_checkpoint.py", line 505, in <module>
    main()
  File "/workspace/models/../scripts/convert_checkpoint.py", line 497, in main
    convert_and_save_hf(args)
  File "/workspace/models/../scripts/convert_checkpoint.py", line 439, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/workspace/models/../scripts/convert_checkpoint.py", line 446, in execute
    f(args, rank)
  File "/workspace/models/../scripts/convert_checkpoint.py", line 425, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 320, in from_hugging_face
    config = LLaMAConfig.from_hugging_face(hf_config_or_dir,
  File "/home/runner/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/config.py", line 120, in from_hugging_face
    hf_config.num_attention_heads)
  File "/home/runner/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 202, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'MllamaConfig' object has no attribute 'num_attention_heads'

Using the tensorrt-llm==0.13.0 with transformers==4.45.2.

JoJoLev commented 1 week ago

Thanks for sharing Malik, this worked for me! For some reason the checkpoint build for llama doesn't work though, I have to go through the quantization path.

malikkirchner commented 1 week ago

@JoJoLev could you please share the code and relevant dependency versions, that worked for you? Does the trtllm-build step work for you?

crslen commented 1 week ago

I'm seeing same error as @malikkirchner with running the convert checkpoint command. It appears the convert checkpoint cannot read the config.json file from the meta-llama/Llama-3.2-11B-Vision-Instruct.

JoJoLev commented 1 week ago

Mine works going down the quantization.py path and building the engine from there for llama3.2-3B with the transformers version mentioned above. I cannot build an engine with the checkpoint approach in the example/llama which is how I normally go about it.

Thanks,

Jordan Leventis

On Tue, Oct 15, 2024 at 10:37 AM Chris Lennon @.***> wrote:

I'm seeing same error as @malikkirchner https://github.com/malikkirchner with running the convert checkpoint command. It appears the convert checkpoint cannot read the config.json file from the meta-llama/Llama-3.2-11B-Vision-Instruct.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/2320#issuecomment-2414359095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMKQ2Y7PM5HRLZ6TEHLRPW3Z3UZDRAVCNFSM6AAAAABPZJWQLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUGM2TSMBZGU . You are receiving this because you were mentioned.Message ID: @.***>

GaneshDoosa commented 1 week ago

While converting the checkpoints for llama3.2 models, facing the following error after few iterations but same was working fine for llama3-8B model, can anyone look into it? does it require to change convert_checkpoint.py file?: 117it [00:00, 649.28it/s] Traceback (most recent call last): File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 487, in main() File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 479, in main convert_and_save_hf(args) File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 421, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 428, in execute f(args, rank) File "/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 410, in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 363, in from_hugging_face loader.generate_tllm_weights(model) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 326, in generate_tllm_weights tllm_weights.update(self.load(tllm_key)) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 268, in load v = sub_module.postprocess(tllm_key, v, postprocess_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 378, in postprocess weights = weights.to(str_dtype_to_torch(self.dtype)) AttributeError: 'NoneType' object has no attribute 'to'**

https://github.com/NVIDIA/TensorRT-LLM/issues/2339

JoJoLev commented 3 days ago

Just to report, I got a llama3.2-3B engine built on TensorRT-LLM version 13 using the quantization.py setup. I am running on Triton v24.09. Blazingly fast.

ishandhanani commented 1 hour ago

Can you share how you got it up and running on Triton?